In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Py-NLP

In this assignment, we will focus on pre-process texts, extract text features and use them to perform text similarities and some basic sentiment analysis. The assignment consists of three parts: in the first part, you are given some functions that will constitute your text pre-processing pipeline; in the second part, you will use the defined pre-processing steps to clean some texts, you will theen vectorize the clean texts using scikit-learn and finally you will use the feature vectors to find text similarity; in the third part, you will perform some basic sentiment analysis.

To complete this assignment, you need to install [spaCy](https://spacy.io/usage/models#download). You can type the following in the command prompt to install spaCy:

``
pip install -U spacy
python -m spacy download en_core_web_sm
``

To use spacy, you can import spaCy and the language model as follows:
```python

import spacy
lang_model = spacy.load("en_core_web_sm")
```

In [22]:
#!pip install -U spacy
#!python -m spacy download en_core_web_sm

In [23]:
import spacy
lang_model = spacy.load('en_core_web_sm')

## Part 1: Text Pre-Processing 

Before we carry out any analysis with texts, we need to clean textual data and process it into more easy-to-interpret formats. This pre-processing stage can include the following steps:
- removal of special and accented characters 
- expansion of the contractions
- removal of punctuation 
- lowering the text case
- removal of extra spaces
- removal of stop words
- lemmatization
- tokenization

You are given a list of functions that supports these operations. Take a look at them in `functions.py`. Once you're done, test the `normalize_text` function on the following sample corpus (or any other sentence of your choice):

In [24]:
sample_corpus = ["\n\n\n The Red Bull driver wasn't quick enough and he couldn't win the race.", 
                 "Hey that's a great news!! I think we won't mind it at all.", 
                 "@@You'll (learn) very cool%% topics!", 
                 "N.G.O. stands for non-governmental organization #service @everywhere...!!"
                ]


from functions import normalize_text


cleaned = []
for x in sample_corpus:
    
    cleaned.append(normalize_text(x, lang_model, lemmatizing=True, stop_words=True))
    
print(cleaned)


['red bull driver quick win race', 'hey great news think mind', 'learn cool topic', 'n.g.o stand non governmental organization service']


Don't forget to test your function on the given corpus sample. Note that these functions process one text from some corpus and to normalize the entire process, you need to loop over the texts of the given corpus.

## Part 2: Text Similarity

In this part, you will use the metric cosine similarity to find the closest test to a given text (to gain an understanding of what cosine similarity means and how it is different from a regular distance metric, you can check the two slides on canvas). You will use the dataset [recipes.csv](recipes.csv). The dataset consists of the steps of 1000 recipes that were extracted from [here](https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions?select=RAW_recipes.csv).

1. Load the [recipes.csv](data/recipes.csv) dataset. 
2. Clean and pre-process the steps of each recipe using the function you defined in the first part (`normalize_text`).
3. Convert the steps of each recipe into a vector using [`tfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from scikit-learn. 
4. You now have a matrix where each row represents the text of the steps of one recipe. From scikit-learn metrics, use [`cosine_similarity`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) to compute the pairwise similarities between all recipes.
5. Suppose a user rated the second recipe with a high rating and you want to recommend to this user 5 similar recipes. Use the obtained cosine similarity scores to find the best 5 recipes to recommend based on the highest 5 cosine similarity measures that correspond to the second recipe. (*Hint*: you can use numpy.argsort()). We used here the steps of recipes to find a recipe similar to a given one. What other features could we also include?

We here slightly touched on the topic of recommendation systems. More complex recommenders could be also developed. 

In [25]:
df = pd.read_csv('recipes.csv')

df.head()

Unnamed: 0,steps
0,make a choice and proceed with recipe dependi...
1,preheat oven to 425 degrees f press dough int...
2,brown ground beef in large pot add chopped on...
3,place potatoes in a large pot of lightly salt...
4,"mix all ingredients& boil for 2 1 / 2 hours ,..."


In [26]:
df['steps'] = df['steps'].apply(normalize_text, args=[lang_model, ['~','@', '#', '$', '%', '^', '&', '*'], False, True, True])



In [62]:

from sklearn.feature_extraction.text import TfidfVectorizer


tfidf = TfidfVectorizer()

transformed_text = tfidf.fit_transform(df['steps'])

from sklearn.metrics.pairwise import cosine_similarity

transform = cosine_similarity(transformed_text)
similar_to_2 = np.argsort(transform[1:2,:]).ravel()
similar_to_2[-6:-1]


array([307, 287, 661, 357, 744], dtype=int64)

Similar recipes are 590, 967, 396, 964 and 152

In [48]:
df['steps'].loc[1]

'preheat oven 425 degree f press dough side 12 inch pizza pan bake 5 minute set brown cut sausage small piece whisk egg milk bowl frothy spoon sausage baked crust sprinkle cheese pour egg mixture slowly sausage cheese s p taste bake 15 20 minute egg set crust brown'

In [64]:
df['steps'].loc[744]

'remove skin sausage cut 1 1 1 2 piece add oil fry sausage brown long pink drain paper towel use favorite recipe'

## Part 3: Sentiment Analysis

In this final part, you will explore some of the sentiment analysis technique. In the first question, you will use [yelp_reviews.csv](yelp_reviews.csv) dataset that consists of text reviews and their corresponding labels: 1 (for the positive review) and 0 (for the negative review). The dataset was downloaded from [UCI repository](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences). Your task is to train a linear SVM to perform text classification for sentiment analysis. In the second question, you will use [music_instr_reviews.csv](data/music_instr_reviews.csv) dataset that consists of text reviews for some musical instruments purchased from Amazon. The dataset was extracted from this [website](http://jmcauley.ucsd.edu/data/amazon/). This data comes without labels and you will perform unsupervised sentiment analysis using [spaCyTextBlob](https://spacy.io/universe/project/spacy-textblob). 

1. Load the [yelp_reviews.csv](data/yelp_reviews.csv) dataset. The first column consists of the reviews and the second column consists of the labels. Clean the texts of the reviews. Define a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) that consists of a vectorizer (`CountVectorizer` or `tfidfVectorizer`) and a linear SVM (`SVC(kernel='linear')` or `LinearSVC()`). Perform 5-fold cross-validation ([cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)) and compute the mean of the obtained accuracies. In this question, you don't need to change the classifier. Focus instead on changing some of the pre-processing steps (with or without lemmatization or removal of stop words for example) or change the type of vectorizer (or some of its arguments). Comment on the obtained performances. (Note: we usually have to first split the data into training and testing sets, and then apply cross-validation on the training set. For this exercise, assume the data you have is the training dataset).

2. **Optional**. Load the  [music_instr_reviews.csv](data/music_instr_reviews.csv) dataset. Use [spaCyTextBlob](https://spacy.io/universe/project/spacy-textblob) to compute the polarity of each review. Examine the scores of some reviews of your choice. 

*Note*: sentiment scores can be used as a feature added to your data if needed. Another way to summarize the text in the textual columns is to perform topic modeling. If interested, you can check this [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9120935/pdf/fsoc-07-886498.pdf) for a review on the topic and you can also check this technique [BERTopic](https://spacy.io/universe/project/bertopic) provided by SpaCy.

In [53]:
yelp_df = pd.read_csv('yelp_reviews.csv')

yelp_df.head()

Unnamed: 0,review,label
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [54]:
X = yelp_df['review'].apply(normalize_text, args=[lang_model, ['~','@', '#', '$', '%', '^', '&', '*'], False, True, True])


In [57]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC


pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('svm', LinearSVC())
])


pipe.fit(X, yelp_df['label'])

In [65]:
preds = pipe.predict(X)

from sklearn.model_selection import cross_validate

acc = cross_validate(pipe, X, yelp_df['label'], cv=5, scoring='f1')

In [68]:
np.mean(acc['test_score'])


0.7804409974777994