# NATURAL LANGUAGE PROCESSING

### Importing libraries

In [1]:
import numpy as np
import pandas as pd
import string
import re
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

## Loading and preparing the files

I've found so far two ways of doing it, but one of them does not close the file, so I will put it as it is simpler, and really useful when it comes to understanding what we are doing, but also much less efficient, so I'll be using both:

### First way (Easy one):

In [None]:
#Necessary encoding, otherwise it will not work
train_file = open('../data/movie_data/full_train.txt', 'r', encoding = "utf8") 
test_file = open('../data/movie_data/full_test.txt', 'r', encoding = "utf8")

### Transforming the files into readable ones

The 'train_file' type is a '_io.TextIOWrapper' object, which can not be iterated as it has no index. However, you can do create a list of 'lines'

In [None]:
reviews_train = [review.strip() for review in train_file]
reviews_test = [review.strip() for review in test_file]

### Printing the list results

Now we will be printing the first item of the new lists created "reviews_train" and "reviews_test". being each one a complete review, so the lists should have the total number of reviews:

In [None]:
print(f'TRAINING DATA: \n\n    First paragraph:\n\n{reviews_train[0]}\
                \n\n Number of training reviews: {len(reviews_train)}')
print('\n')
print(f'TESTING DATA: \n\n    First paragraph:\n\n{reviews_test[0]}\
                \n\n Number oftesting reviews: {len(reviews_test)}')

### Second way (Efficient one):

So in this case, we will create the lists in the same cells by using the 'with' 'using statement', which will close the file after the action it is meant to perform

In [2]:
with open('../data/movie_data/full_train.txt', 'r', encoding = "utf8") as train_file:
    reviews_train = [review.strip() for review in train_file]
    
with open('../data/movie_data/full_test.txt', 'r', encoding = "utf8") as test_file:
    reviews_test = [review.strip() for review in test_file]

So we just need to print it:

In [3]:
print(f'TRAINING DATA: \n\n    First paragraph:\n\n{reviews_train[0]}\
                \n\n Number of training reviews: {len(reviews_train)}')
print('\n')
print(f'TESTING DATA: \n\n    First paragraph:\n\n{reviews_test[0]}\
                \n\n Number oftesting reviews: {len(reviews_test)}')

TRAINING DATA: 

    First paragraph:

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!                

 Number of training reviews: 25000


TESTING DATA: 

    First paragraph:

I went and saw this movie last night after being coaxed to by 

# Warning!!

### *Run the first one only, and just only to understand how the code is working, once you have done it ignore it and run the second for better performance*

## Data wrangling

### Data cleaning:


So now that we have the two lists with the reviews we want in each one, let's prepare them for the analysis:

In [4]:
def cleaning(par):
    
    '''Function to remove punctuation
    from the given paragraph and html 
    expressions that are not needed'''
    
    html = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
    
    final = [s.translate(str.maketrans('', '', string.punctuation)).lower() for s in par]
    final_par = [html.sub(" ", line) for line in final]
    return final

Let's apply the function to the previous lists and change their names

In [5]:
clean_review_train = cleaning(reviews_train)
clean_review_test = cleaning(reviews_test)

### Preparing the vectorized matrix

Now that we have already cleaned the data, we will vectorize the data we have to fit it into the model we will create later

In [6]:
TfidVect = TfidfVectorizer()
tfv_review_train_matrix = TfidVect.fit_transform(clean_review_train)

## Training the model

So now that we have understood our model, we can start training the model. We will assign positive sentiment to the first half of the dataset and negative one to the second part (as the dataset is already prepared for that), so we will have 1's from 0 to 12500 and 0's from 12500 to 25000 and the variables for the split test

In [7]:
target = [1 if i < 12500 else 0 for i in range(25000)]

X_tr_train, X_tr_test, y_tr_train, y_tr_test = \
    train_test_split(tfv_review_train_matrix, target, train_size = 0.75)

Now, to know which is the best inverse regularization value we will fit the model and loop the values until we got the value we want (evading overfitting at the same time) 

In [8]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c)
    lr.fit(X_tr_train, y_tr_train)
    score = accuracy_score(y_tr_test, lr.predict(X_tr_test))
    print(f"Accuracy for C = {c}: {score}")



Accuracy for C = 0.01: 0.79376
Accuracy for C = 0.05: 0.82656
Accuracy for C = 0.25: 0.86688
Accuracy for C = 0.5: 0.87552
Accuracy for C = 1: 0.88368


And finally, let's try the model with the test dataset we have so we can see whether it is working or not

In [9]:
tfv_review_test_matrix = TfidVect.transform(clean_review_test)
final_model = LogisticRegression(C=1)
final_model.fit(tfv_review_train_matrix, target)
accuracy = accuracy_score(target, final_model.predict(tfv_review_test_matrix))
print(f"Accuracy of the model = {accuracy}")

Accuracy of the model = 0.88336


So we can see here that we can put whatever phrase we want to try and the prediction will be applied

In [10]:
test_phrase = ["pirate ships and go from that shelf.com another night here in Canton this has been an incredible day for someone to talk about it is Once Upon a Time in Hollywood now the title of Tarantino's latest come should give you a hint that this is not a straightforward story this is a fable this is one of his historical fantasy does it work where he's going to take the storyline that we know and twisted ever-so-slightly or in some ways a major ways this is a easy film to spoil so I'm not going to do that but what I will say is that this has some extraordinary moments and some incredible performances this film just incredible incredible power house and again who's playing Sharon Tate in the butt injection she just absolutely bring much more than a social reasons and television was eating into the holy thoughts and hope of the hippie generation and the sort of Darkness underneath that it's all of these things that I've only had a few hours", "what's up guys don't have to do this in a much more like formal environment where I would you know such a camera and mic myself and all that stuff but tonight just because I wanted to get it down and dirty done fast and give you guys the goods on Once Upon a Time in Hollywood I thought I would just hold the phone by and talk to you all and just post a review so I saw once upon a time in Hollywood is a non-spoiler review"]
trial = TfidVect.transform(test_phrase)
lr.predict(trial)

array([1, 0])