# PES University, Bangalore
Established under Karnataka Act No. 16 of 2013

## UE20CS312 - Data Analytics - Worksheet 4a
Course instructor: Gowri Srinivasa, Professor Dept. of CSE, PES University  
Designed by Harshith Mohan Kumar, Dept. of CSE - harshithmohankumar@pesu.pes.edu  
  
Renita Kurian - PES1UG20CS331

### Collaborative & Content based filtering
The **Collaborative filtering method** for recommender systems is a method that is solely based on the past interactions that have been recorded between users and items, in order to produce new recommendations.

The **Content-based** approach uses additional information about users and/or items. The Content-based approach requires a good amount of information about items’ features, rather than using the user’s interactions and feedback.

### Prerequisites
- Revise the following concepts
    - TF-IDF
    - Content-based filtering
    - Cosine Similarity
- Install the following software
    - pandas
    - numpy
    - sklearn

### Task
After the disastrous pitfall of [Game of Thrones season 8](https://en.wikipedia.org/wiki/Game_of_Thrones_(season_8)), George R. R. Martin set out to fix mindless mistakes caused by the producers David and Daniel.

A few years down the line, we now are witnessing George R. R. Martin's latest work: [House of the Dragon](https://www.hotstar.com/in/tv/house-of-the-dragon/1260110208). This series is a story of the Targaryen civil war that took place about 200 years before events portrayed in Game of Thrones.

In this notebook you will be exploring and analying tweets related to The House of Dragon TV series. First we shall tokenize the textual data using TF-IDF. Then we will proceed to find the top-k most similar tweets using cosine similarity between the transformed vectors.

The dataset has been extracted using the [Twitter API](https://developer.twitter.com/en/docs/twitter-api) by utilizing a specific search query. The data has been extensively preprocessed and a small subset has been stored within the `twitter_HOTD_DA_WORKSHEET4A.csv`

**Note:** This notebook may contain spoilers to the show.

### Data Dictionary
**author_id**: A unique identifier assigned to the twitter user.

**tweet_id**: A unique identifier assigned to the tweet.

**text**: The text associated with the tweet.

**retweet_count**: The number of retweets for this particular tweet.

**reply_count**: The number of replies for this particular tweet.

**like_count**: The number of likes for this particular tweet.

**quote_count**: The number of quotes for this particular tweet.

**tokens**: List of word tokens extracted from `text`.

**hashtags**: List of hashtags extracted from `text`.

### Points
The problems in this worksheet are for a total of 10 points with each problem having a different weightage.
- Problem 1: 4 points
- Problem 2: 4 points
- Problem 3: 2 points

### Loading the dataset

In [20]:
# Import pandas
import pandas as pd
# Use pandas read_csv function to load csv as DataFrame
df = pd.read_csv('./twitter_HOTD_DA_WORKSHEET4A.csv')
#df.head()

### Problem 1 (4 points)

Tokenize the string representations provided in the **tokens** column of the DataFrame using TF-IDF from sklearn. Then print out the TF-IDF of the first row of the DataFrame.

Solution Steps:
1. Initialize the `TfidfVectorizer()`
2. Use the `.fit_transform()` method on the entire text
3. `.transform()` the Text
4. Print number of samples and features using `.shape`
5. Print the TF-IDF of the first row

For futher reference: https://www.analyticsvidhya.com/blog/2021/09/creating-a-movie-reviews-classifier-using-tf-idf-in-python/

In [10]:
# Imports
import numpy as np 
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

In [21]:
# Convert string representation of a list into a list of strings
import ast
text = []
for r in df['tokens']:
    res = ast.literal_eval(r)
    if(' '.join(res).lower() == ''):
        print(r)
    text.append(' '.join(res).lower())
# Print the end result
#text

In [22]:
v=TfidfVectorizer()
wc_vec=v.fit_transform(text)

In [23]:
wc_vec.shape

(8061, 10950)

In [11]:
tfidf_transformer = TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(wc_vec)

TfidfTransformer()

In [24]:
# print idf values 
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=v.get_feature_names(),columns=["tokens"]) 
# sort ascending 
final = df_idf.sort_values(by=['tokens'])
final.loc['dragon']['tokens']
row1_list = text[0].split()
for i in row1_list:
  print(i, final.loc[i]['tokens'])

perform 9.301769763117166
duty 7.7976923663408915
mother 6.233716827983549
betrothed 9.301769763117166




### Problem 2 (4 points)

Find the top-5 most similar tweets to the tweet with index `7558` using cosine similarity between the TF-IDF vectors.

Solution Steps:
1. Import `cosine_similarity` from sklearn.metrics.pairwise
2. Compute `cosine_similarity` using text_tf with index `7558` and all other rows
3. Use `argsort` to sort the cosine_similarity results
4. Print indices of top-5 most similar results from sorted array (hint: argsort sorts in ascending order)
5. Display text of top-5 most similar results using `df.iloc[index]`

In [13]:
# Print out the tokens from index `7654`
print(text[7558])
# Print out the text from index `7654`
print(df.iloc[7558][2])

viserys wanna build lego set mind business let man live peace
rt viserys just wanna build his lego set and mind his business . let that man live in peace #houseofthedragon


In [14]:
from sklearn.metrics.pairwise import cosine_similarity 

In [27]:
tfidf = TfidfVectorizer().fit_transform(text)
cos_similarity = cosine_similarity(tfidf, tfidf)
print(cos_similarity)

[[1. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 1.]]


In [28]:
def get_recommendations(title, cos_s):
    idx = 7558
    scores = list(enumerate(cos_s[idx]))
    scores = sorted(scores, key=lambda x: x[1], reverse=True)
    scores = scores[1:6]
    indices = [i[0] for i in scores]
    return df.iloc[indices]['text']

In [29]:
print(get_recommendations(text[7558], cos_similarity))

7595    viserys just wanna build his lego set and mind...
3656    rt my man's literally just trying to build a l...
6548    rt daemon and rhaenyra are never letting this ...
7705    mom said it's my turn on the valyrian lego set...
3534    rt don't play with them. they're here for busi...
Name: text, dtype: object


### Problem 3 (2 point)

A great disadvantage in using TF-IDF is that it can not capture semantics. If you had classify tweets into positive/negative, what technique would you use to map words to vectors? In short words, provide the sequence of solution steps to solve this task. Note: Assume sentiment labels have been provided. 

(Hint: take a look at how I've provided solution steps in previous problems)

Solution Steps:
  1. Remove stop words  
  2. Lemmatize
  3. Word Embedding -> map words to vector of real numbers. Continous Bag of Words Model can be used to predict word given context.