DSCI 510 Fall 2020 Final Project Submission

1.	**My name**

    Katie Chak

2.	**Any major “gotchas” to the code (i.e. things that don’t work, go slowly, could be improved, etc.)**

    The free version of Paralleldots emotion detection API (the one I am using for this project) allows 1,000 hits per day and 20 hits per minute. Here is everything that I run emotion detection API for: 
    
    1. Plot summaries of 41 Rick and Morty episodes (from rickandmorty.fandom.com)
    2. 1 summary each for ~300 cannabis strains (from wikileaf.com)
    
    This adds to a total of 350 hits for each full run of the program. So one can only run it three times a day to accomendate API hit limits.
    
    The program takes names from two APIs (Rick and Morty and Marijuana Strain) to two websites, then the content of the websites are analyzed by Paralleldots API, the whole program takes ~12 minutes to complete. Please be patient while you run it.

3.	**Any libraries that need to be installed to run your code (see above)**
    
    Some external packages that need to be installed to run my program are BeautifulSoup and paralleldots. All the envirionment requirements are included in requirements.txt

4.	**Anything else you feel is relevant to the grading of your project your project.**
    
    Have to run CHAK_KATIE_proj2.py --source remote on command line first before running the code below for my project. This file is in /src folder. Takes about 12 minutes to run.



5.	**What did you set out to study?  (i.e. what was the point of your project?  This should be close to your Milestone 1 assignment, but if you switched gears or changed things, note it here.)**
    
    My project objective is to match episodes of Rick and Morty with marijuana strains by the emotions that they invoke. I used text emotion analysis to get the emotion of Rick and Morty episodes by looking through plot summaries. For each strains, I use descriptions of the strians on Wikileaf.com.

6.	**What did you Discover/what were your conclusions (i.e. what were your findings?  Were your original assumptions confirmed, etc.?)**
    
    I was able to successfully match strains with episodes and vice versa. See code below.


7.	**What difficulties did you have in completing the project?**

    
    Many of the websites are dynamic, so I took note to choose more stable websites that do not have content updates frequently. Because my program takes a while to run, every time I try to test it I would need to wait a long time, which is time-consuming.

8.	**What skills did you wish you had while you were doing the project?**

    I wish I had known more about dynamic web scraping so I can use contents of websites more effectively.

9.	**What would you do “next” to expand or augment the project?**

    I would want to match episodes with strains with more criterias. I could go to different Rick and Morty forums and get more input for each episode in order to produce more accurate emotion analysis. And I can go through more strain reviews on marijuana website to get more datapoints as well.



In [46]:
import requests
import json
import bs4
from bs4 import BeautifulSoup
import paralleldots
import sqlite3
import time
import csv
import argparse
import numpy as np
import pandas as pd

In [47]:
def create_connection():
    conn=sqlite3.connect('510FinalProject.db')
    cur=conn.cursor()
    return (conn,cur)

In [48]:
# this block will match chosen episode with strains
conn, cur=create_connection()
match_dict={}
for num in range (1,42): #41 episodes in total
    # selects the row of each episode, with 6 emotions and their emotion names columns on each row
    cur.execute(f"SELECT * FROM episode_emotion_table WHERE episode_id={num}")
    emotionlist=cur.fetchall()[0]
    emotiondic={}
    emotionpos=[1,3,5,7,9,11] # these are positions in emotionlist, they each correspond to names of the 6 emotions
    for position in emotionpos:
        emotiondic[emotionlist[position]]=emotionlist[position+1]
    # sort the emotion dictionary in descending order to get the most salient emotion
    sort_emotion = sorted(emotiondic.items(), key=lambda x: x[1], reverse=True)
    # this stores the most and second most salient emotion for the episode
    top_episode_emo=sort_emotion[0][0]
    top_emo_score=sort_emotion[0][1]
    # the table will sort first by the the top emotion then the second most salient emotion
    second_emo=sort_emotion[1][0] 
   

    # selects the strain that exhibits the same top and second top emotions as the episode
    cur.execute(f"SELECT * FROM (SELECT *FROM strain_review_table ORDER BY {top_episode_emo} DESC LIMIT 100) ORDER BY {second_emo} DESC LIMIT 1") 
    # this stores the top strain that best matches the episode by the most salient emotion
    matchedstrain=cur.fetchall()[0][1]
    
    # sort strain_review_table first by top scored emotion, then by second_scored emotion to get strain ID
    cur.execute(f"SELECT * FROM (SELECT *FROM strain_review_table ORDER BY {top_episode_emo} DESC LIMIT 100) ORDER BY {second_emo} DESC LIMIT 1") 
    strainid=cur.fetchall()[0][0]
    
    
    # get top and second emotions of that strain
    cur.execute(f"SELECT * FROM strain_review_table WHERE strain_id = {strainid}")
    emotionlist=cur.fetchall()[0]
    strainpos=[2,4,6,8,10,12]
    for position in strainpos:
        emotiondic[emotionlist[position]]=emotionlist[position+1]
    # sort the emotion dictionary in descending order to get the most salient emotion
    sort_emotion1 = sorted(emotiondic.items(), key=lambda x: x[1], reverse=True)
    top_strain_emo1=sort_emotion1[0][0]
    second_strain_emo=sort_emotion1[1][0]
  
    # gets the episode's name from episode_table
    cur.execute(f"SELECT * FROM episode_table WHERE episode_id={num}")
    episode_name=cur.fetchall()[0][1]
    
    # delete a strain if it already matched with an episode, to keep things more fun ;D
    cur.execute(f"DELETE FROM strain_review_table where strain_id = {strainid}")
    
    # adds each episode's information and matched strain into dictionary
    if "episode_num" not in match_dict:
        match_dict["episode_num"]=[num]
    else:
        match_dict["episode_num"].append(num)
    
    if "episode_name" not in match_dict:
        match_dict["episode_name"]=[episode_name]
    else:
        match_dict["episode_name"].append(episode_name)
    
    if "top_episode_emotion" not in match_dict:
        match_dict["top_episode_emotion"]=[top_episode_emo]
    else:
        match_dict["top_episode_emotion"].append(top_episode_emo)
    
    if "second_episode_emotion" not in match_dict:
        match_dict["second_episode_emotion"]=[second_emo]
    else:
        match_dict["second_episode_emotion"].append(second_emo)
        
    if "matched_strain" not in match_dict:
        match_dict["matched_strain"]=[matchedstrain]
    else:
        match_dict["matched_strain"].append(matchedstrain)
    
    if "strain_top_emotion" not in match_dict:
        match_dict["strain_top_emotion"]=[top_strain_emo1]
    else:
        match_dict["strain_top_emotion"].append(top_strain_emo1)
    
    if "strain_second_emotion" not in match_dict:
        match_dict["strain_second_emotion"]=[second_strain_emo]
    else:
        match_dict["strain_second_emotion"].append(second_strain_emo)
    
conn.commit()
conn.close()


In [49]:
# converting the dictionary into dataframe
df = pd.DataFrame.from_dict(match_dict)
df #shows all matches

Unnamed: 0,episode_num,episode_name,top_episode_emotion,second_episode_emotion,matched_strain,strain_top_emotion,strain_second_emotion
0,1,Pilot,bored,angry,rockstar,angry,sad
1,2,Lawnmower Dog,excited,bored,cuvee,excited,happy
2,3,Anatomy Park,sad,fear,frosty,fear,sad
3,4,M. Night Shaym-Aliens!,fear,excited,cheese,excited,happy
4,5,Meeseeks and Destroy,fear,sad,cactus,sad,angry
5,6,Rick Potion #9,happy,excited,j-27,excited,happy
6,7,Raising Gazorpazorp,fear,excited,headband,excited,happy
7,8,Rixty Minutes,fear,sad,primus,sad,fear
8,9,Something Ricked This Way Comes,fear,excited,nuken,excited,happy
9,10,Close Rick-counters of the Rick Kind,angry,excited,khufu,excited,angry


In [50]:
# user can give input to what episode they are watching and program matches strain
user_episode=int(input("What episode are you watching? (enter episode number 1-41)--->"))
#get the row of the desired episode
match=df[df['episode_num']==user_episode]
episode_name=match.iloc[0]["episode_name"]
top_epi_emo=match.iloc[0]["top_episode_emotion"]
sec_epi_emo=match.iloc[0]["second_episode_emotion"]
top_strain_emo=match.iloc[0]["strain_top_emotion"]
sec_strain_emo=match.iloc[0]["strain_second_emotion"]
matchstrain=match.iloc[0]["matched_strain"]
print(f"The episode that you have chosen is {episode_name}")
print(f"The strain best matched with this episode is {matchstrain}")
print(f"They are best matches because:\n Top emotions for episode {episode_name} is {top_epi_emo} and {sec_epi_emo} \n Top emotions for strain {matchstrain} is {top_strain_emo} and {sec_strain_emo}")
print(f"Learn more about {matchstrain} at https://www.leafly.com/strains/{matchstrain}")

What episode are you watching? (enter episode number 1-41)--->3
The episode that you have chosen is Anatomy Park
The strain best matched with this episode is frosty
They are best matches because:
 Top emotions for episode Anatomy Park is sad and fear 
 Top emotions for strain frosty is fear and sad
Learn more about frosty at https://www.leafly.com/strains/frosty


In [51]:
# This block will match chosen strain with episodes
conn, cur=create_connection()
strain_emo={}
match_episode={}
#gets number of rows in the strain table
cur.execute("SELECT * FROM strain_review_table")
wholetable=cur.fetchall()
for row in wholetable:
    
    strainname=row[1]
    #happy --> happy score
    strain_emo[row[2]]=row[3]
    #angry--> angry score
    strain_emo[row[4]]=row[5]
    #bored --> bored score
    strain_emo[row[6]]=row[7]
    #fear --> fear score
    strain_emo[row[8]]=row[9]
    #sad --> sad score
    strain_emo[row[10]]=row[11]
    #excite --> excited socre
    strain_emo[row[12]]=row[13]
    #sort each dictionary of emotions from highest to lowest scores for strains
    sort_emotion2 = sorted(strain_emo.items(), key=lambda x: x[1], reverse=True)
    #get top and second top emotions for the strain
    top_strain_emo2=sort_emotion2[0][0]
    second_strain_emo2=sort_emotion2[1][0]
    
    # gets the best matched episode number for the strain based on most and second most salient emotions
    cur.execute(f"SELECT * FROM (SELECT *FROM episode_emotion_table ORDER BY {top_strain_emo2} DESC LIMIT 100) ORDER BY {second_strain_emo2} DESC LIMIT 1")
    matched_episode_num=cur.fetchall()[0][0]
    
    cur.execute(f"SELECT * FROM episode_emotion_table WHERE episode_id = {matched_episode_num}")
    emotionlist=cur.fetchall()[0]
    emotiondic={}
    
    emotionpos=[1,3,5,7,9,11] # these are positions in emotionlist, they each correspond to names of the 6 emotions
    for position in emotionpos:
        emotiondic[emotionlist[position]]=emotionlist[position+1]
    # sort the emotion dictionary in descending order to get the most salient emotion
    sort_emotion3 = sorted(emotiondic.items(), key=lambda x: x[1], reverse=True)
    # this stores the most and second most salient emotion for the episode
    top_episode_emo2=sort_emotion3[0][0]

    # the table will sort first by the the top emotion then the second most salient emotion
    second_episode_emo=sort_emotion3[1][0] 
   
    
    
    # gets the episode's name from episode_table
    cur.execute(f"SELECT * FROM episode_table WHERE episode_id={matched_episode_num}")
    episode_name2=cur.fetchall()[0][1]
    
    
    if "strain_name" not in match_episode:
        match_episode["strain_name"]=[strainname]
    else:
        match_episode["strain_name"].append(strainname)
    
    if "top_strain_emotion" not in match_episode:
        match_episode["top_strain_emotion"]=[top_strain_emo2]
    else:
        match_episode["top_strain_emotion"].append(top_strain_emo2)
    
    if "second_strain_emotion" not in match_episode:
        match_episode["second_strain_emotion"]=[second_strain_emo2]
    else:
        match_episode["second_strain_emotion"].append(second_strain_emo2)
    
    if "episode_num" not in match_episode:
        match_episode["episode_num"]=[matched_episode_num]
    else:
        match_episode["episode_num"].append(matched_episode_num)
        
    if "episode_name" not in match_episode:
        match_episode["episode_name"]=[episode_name2]
    else:
        match_episode["episode_name"].append(episode_name2)
    
    if "top_episode_emo" not in match_episode:
        match_episode["top_episode_emo"]=[top_episode_emo2]
    else:
        match_episode["top_episode_emo"].append(top_episode_emo2)
    
    if "second_episode_emo" not in match_episode:
        match_episode["second_episode_emo"]=[second_episode_emo]
    else:
        match_episode["second_episode_emo"].append(second_episode_emo)
    

conn.commit()
conn.close()

In [52]:
df_episode = pd.DataFrame.from_dict(match_episode)
df_episode #shows all matches

Unnamed: 0,strain_name,top_strain_emotion,second_strain_emotion,episode_num,episode_name,top_episode_emo,second_episode_emo
0,african,happy,excited,2,Lawnmower Dog,excited,bored
1,alaska,happy,angry,13,Mortynight Run,angry,fear
2,alchemy,excited,happy,21,The Wedding Squanchers,happy,excited
3,alf,fear,sad,16,Get Schwifty,sad,angry
4,allkush,fear,sad,16,Get Schwifty,sad,angry
...,...,...,...,...,...,...,...
219,wreckage,excited,angry,13,Mortynight Run,angry,fear
220,xanadu,happy,excited,2,Lawnmower Dog,excited,bored
221,yumboldt,excited,happy,21,The Wedding Squanchers,happy,excited
222,yummy,excited,happy,21,The Wedding Squanchers,happy,excited


In [53]:
# user can give input to what strain of marijuana they are using and program matches Rick and Morty episode
user_strain=input("What strain are you using?")
if df_episode["strain_name"].str.contains(user_strain).any():
    match=df_episode[df_episode['strain_name']==user_strain]
    num=match.iloc[0]["episode_num"]
    matchepisode=match.iloc[0]["episode_name"]
    print(f"The strain best matched with this episode is {matchepisode}")
    if num!=19 and num!=25 and num!=32 and num!=35: # These are the episodes where the name in api and plot website are different
        name=matchepisode.replace(":","")
    name=name.replace(",","")
    name=name.split()
    episode_name="_".join(name)
    episodeurl=f'https://rickandmorty.fandom.com/wiki/{episode_name}'
    
    top_strain_emo=match.iloc[0]["top_strain_emotion"]
    sec_strain_emo=match.iloc[0]["second_strain_emotion"]
    top_epi_emo=match.iloc[0]["top_episode_emo"]
    sec_epi_emo=match.iloc[0]["second_episode_emo"]
    print(f"They are best matches because:\n Top emotions for strain {user_strain} is {top_strain_emo} and {sec_strain_emo}\n Top emotions for episode {matchepisode} is {top_epi_emo} and {sec_epi_emo}")
    print(f"Learn more about {matchepisode} at {episodeurl} or watch it on Hulu")
else:
    print("The strain you entered is not available, please try another strain")

What strain are you using?yummy
The strain best matched with this episode is The Wedding Squanchers
They are best matches because:
 Top emotions for strain yummy is excited and happy
 Top emotions for episode The Wedding Squanchers is happy and excited
Learn more about The Wedding Squanchers at https://rickandmorty.fandom.com/wiki/The_Wedding_Squanchers or watch it on Hulu
