1:16 
[Link](https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/implementation-removing-punctuation?u=74412268)

Steps to NLP for ML:

1) Raw text - mdoel cannot read text

2) Tokenizing - telling model what to read from file 

3) Clean text - remove stop words, punctuation, stemming, etc.

4) Vectorize - turn data to numeric form

5) Machine Learning - put into model for train/test
    - classification model

## 1) Preprocess Data

In [1]:
import pandas as pd
pd.set_option("display.max_colwidth", 100) #set layout table

my_data = pd.read_csv("Restaurant_Reviews.tsv", sep="\t", header=None)
my_data.columns = ["Review","Positive/Negative rating"]

my_data[-5:]

Unnamed: 0,Review,Positive/Negative rating
996,I think food should have flavor and texture and both were lacking.,0
997,Appetite instantly gone.,0
998,Overall I was not impressed and would not go back.,0
999,"The whole experience was underwhelming, and I think we'll just go to Ninja Sushi next time.",0
1000,"Then, as if I hadn't wasted enough of my life there, they poured salt in the wound by drawing ou...",0


## 2) Remove punctuation

In [2]:
#packages

import re

In [3]:
# Define a function to remove punctuation
def remove_punctuation(text):
    # Use regular expression to remove punctuation and replace with a space
    text = re.sub(r'[^\w\s]', ' ', text)
    return text

In [4]:
# Apply the function to the 'Message' column and create a new column 'Cleaned_Message'
my_data['Cleaned_Review'] = my_data['Review'].apply(remove_punctuation)

# Save the cleaned data to a new TSV file
my_data.to_csv('my_datacleaned_file.tsv', sep='\t', index=False)

In [5]:
my_clean_data = pd.read_csv("my_datacleaned_file.tsv", sep="\t")

my_clean_data[-5:]

Unnamed: 0,Review,Positive/Negative rating,Cleaned_Review
996,I think food should have flavor and texture and both were lacking.,0,I think food should have flavor and texture and both were lacking
997,Appetite instantly gone.,0,Appetite instantly gone
998,Overall I was not impressed and would not go back.,0,Overall I was not impressed and would not go back
999,"The whole experience was underwhelming, and I think we'll just go to Ninja Sushi next time.",0,The whole experience was underwhelming and I think we ll just go to Ninja Sushi next time
1000,"Then, as if I hadn't wasted enough of my life there, they poured salt in the wound by drawing ou...",0,Then as if I hadn t wasted enough of my life there they poured salt in the wound by drawing ou...


## 3) Tokenization

Chops up words in the cleaned_review column. Good for adding vocabulary to model's bank and preprocessing of smaller managable pieces.

In [6]:
#uses re

def data_tokenization(text):
    #for every word in review
    
    #split word
    tokens = re.split("\W+", text)
    return tokens
    #tokenized = original.tokenizeFunct column
my_data['Tokenized'] = my_data['Cleaned_Review'].apply(lambda x:data_tokenization(x.lower()))

my_data[-5:]

Unnamed: 0,Review,Positive/Negative rating,Cleaned_Review,Tokenized
996,I think food should have flavor and texture and both were lacking.,0,I think food should have flavor and texture and both were lacking,"[i, think, food, should, have, flavor, and, texture, and, both, were, lacking, ]"
997,Appetite instantly gone.,0,Appetite instantly gone,"[appetite, instantly, gone, ]"
998,Overall I was not impressed and would not go back.,0,Overall I was not impressed and would not go back,"[overall, i, was, not, impressed, and, would, not, go, back, ]"
999,"The whole experience was underwhelming, and I think we'll just go to Ninja Sushi next time.",0,The whole experience was underwhelming and I think we ll just go to Ninja Sushi next time,"[the, whole, experience, was, underwhelming, and, i, think, we, ll, just, go, to, ninja, sushi, ..."
1000,"Then, as if I hadn't wasted enough of my life there, they poured salt in the wound by drawing ou...",0,Then as if I hadn t wasted enough of my life there they poured salt in the wound by drawing ou...,"[then, as, if, i, hadn, t, wasted, enough, of, my, life, there, they, poured, salt, in, the, wou..."


## Test case sensitivity NLP

Purpose of using .lower method is to turn all capital letters to lower case so it knows the words are the same thing. 

To us, SONIC is the same as sonic, but since Python is case senitive, it see the two words as different things.

In [7]:
"SONIC" == "sonic"

False

## 4) Remove stopwords

In [8]:
import nltk

stopword = nltk.corpus.stopwords.words("english")

In [9]:
def remove_stopwords(tokenized_list):
    my_text = {word for word in tokenized_list if word not in stopword}
    return my_text

my_data['Remove_stop'] = my_data['Tokenized'].apply(lambda x: remove_stopwords(x))

my_data[-5:]

Unnamed: 0,Review,Positive/Negative rating,Cleaned_Review,Tokenized,Remove_stop
996,I think food should have flavor and texture and both were lacking.,0,I think food should have flavor and texture and both were lacking,"[i, think, food, should, have, flavor, and, texture, and, both, were, lacking, ]","{, texture, flavor, lacking, food, think}"
997,Appetite instantly gone.,0,Appetite instantly gone,"[appetite, instantly, gone, ]","{appetite, , gone, instantly}"
998,Overall I was not impressed and would not go back.,0,Overall I was not impressed and would not go back,"[overall, i, was, not, impressed, and, would, not, go, back, ]","{, go, would, back, impressed, overall}"
999,"The whole experience was underwhelming, and I think we'll just go to Ninja Sushi next time.",0,The whole experience was underwhelming and I think we ll just go to Ninja Sushi next time,"[the, whole, experience, was, underwhelming, and, i, think, we, ll, just, go, to, ninja, sushi, ...","{, go, ninja, next, time, whole, sushi, experience, underwhelming, think}"
1000,"Then, as if I hadn't wasted enough of my life there, they poured salt in the wound by drawing ou...",0,Then as if I hadn t wasted enough of my life there they poured salt in the wound by drawing ou...,"[then, as, if, i, hadn, t, wasted, enough, of, my, life, there, they, poured, salt, in, the, wou...","{wasted, , poured, life, enough, time, bring, wound, drawing, salt, check, took}"


## 5) Split into Train/Test

In [10]:
from sklearn.model_selection import train_test_split

#use punctuation removed text,  review, and tokenized version to test if model can properly categorize each review

X_train,X_test,Y_train,Y_test=train_test_split(my_data[["Cleaned_Review","Tokenized","Remove_stop"]], my_data["Positive/Negative rating"], test_size=.2)

## 6) Vectorize data

Turn the text into numerical points for the model to easily use for processing.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

import pandas as pd

In [12]:
data_vector = TfidfVectorizer(analyzer=remove_punctuation) #  'char', 'char_wb', 'word', or a callable function.
data_vector_fit = data_vector.fit(X_train["Cleaned_Review"])

## error on parameters

#fixed by putting function of text with removed punctuation into analyzer - 9/12

data_train = data_vector_fit.transform(X_train["Cleaned_Review"])
data_test = data_vector_fit.transform(X_test["Cleaned_Review"])

#accepts list of object - connect data based on indicies 
#turn matrix to dataframe
#put data in dataframe side by side
X_train_vect = pd.concat([X_train[["Tokenized","Remove_stop"]], 
           pd.DataFrame(data_train.toarray())], axis=1)

X_test_vect = pd.concat([X_test[["Tokenized","Remove_stop"]], 
           pd.DataFrame(data_test.toarray())], axis=1)

X_train_vect.head()

#some words are not recogized to be vectorized which is ok

Unnamed: 0,Tokenized,Remove_stop,0,1,2,3,4,5,6,7,...,55,56,57,58,59,60,61,62,63,64
327,"[you, won, t, be, disappointed, ]","{, disappointed}",0.544608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.24381,0.11755,0.159184,0.098502,0.233051,0.0,0.081459,0.0,0.0,0.0
423,"[furthermore, you, can, t, even, find, hours, of, operation, on, the, website, ]","{, furthermore, hours, website, even, find, operation}",0.690599,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.08588,0.289841,0.112142,0.069393,0.054727,0.0,0.114772,0.0,0.0,0.0
121,"[i, just, don, t, know, how, this, place, managed, to, served, the, blandest, food, i, have, eve...","{, managed, indian, preparing, place, ever, cuisine, blandest, know, food, eaten, served}",0.732504,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.204954,0.131755,0.0,0.110405,0.087071,0.0,0.091302,0.226523,0.0,0.0
553,"[i, would, recommend, saving, room, for, this, ]","{, saving, would, recommend, room}",0.64549,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.154807,0.248794,0.067383,0.0,0.065767,0.0,0.137925,0.0,0.0,0.0
611,"[, the, owners, really, really, need, to, quit, being, soooooo, cheap, let, them, wrap, my, frea...","{, quit, sandwich, one, soooooo, owners, wrap, let, really, need, freaking, two, cheap, papers}",0.661202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.356789,0.161053,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 7) Final Evaluation of Model

Use a classification model- Random Forrest


In [13]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time
#checkpoint: https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/model-selection-data-prep?u=74412268


In [15]:
my_rf = RandomForestClassifier(n_estimators=155, max_depth = None, n_jobs=-1)
#n_jobs =-1 menas to run all parameters simultaneously

#train model
my_rf_model = my_rf.fit(X_train_vect, Y_train)
my_y_pred = my_rf_model.predict(X_test_vect)

#all parameters we want from running model
precision,recall, fscore, train_support = score(Y_test,my_y_pred,pos_label="PositiveRev", average="binary")
#output results - rounding 3  decimal places
print("Precision: {} | ) Recall: {} / Accuracy: {}".format(
    round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

## need to fix X_train_vect = pd.concat([X_train[["Tokenized","Remove_stop"]], 
#            pd.DataFrame(data_train.toarray())], axis=1)

# X_test_vect = pd.concat([X_test[["Tokenized","Remove_stop"]], 
#            pd.DataFrame(data_test.toarray())], axis=1)

# X_train_vect.head()

TypeError: Feature names are only supported if all input features have string names, but your input has ['int', 'str'] as feature name / column name types. If you want feature names to be stored and validated, you must convert them all to strings, by using X.columns = X.columns.astype(str) for example. Otherwise you can remove feature / column names from your input data, or convert them all to a non-string data type.