# **Sentiment Analysis with Scikit Learn**



Sentiment analysis is an application of data mining which has as a goal to extract the opinion of the writer about something and clasiffy it as positive or negative state. Thus the importance of this application is great since its use is widely applied to voice-of-cusstomer materials such as reviews of products or services with the purpose of helping companies in the decision-making process of improving their products-services, stop their production or maintain the good quality of them. Of course this application can also be expanded to the social factor by getting also the opinion of people about the current political situation of their country, if they are satisfied with the educational and health system or in general with any social matter that concerns them.

The main ways of conducting a sentiment analysis are three:

*   The machine learning-based approach

*   The lexicon-based approach
*   The hybrid approach which is a combination of the two above mentioned approaches

In this project I am trying to create a sentiment analysis classifier for greek language with the help of Scikit learn which is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. Thus i followed the machine learning-based approach.










# **Dataset**

The dataset 'Athinorama_movies_dataset' that I used in order to train and test  this classifier contains 131108 greek reviews of movies and has 15 columns. It  is accessible and downloable via kaggle in a zip csv format and it was selected due to its efficient size.

In [None]:
 # instal the kaggle library
! pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# make directory named 'Kaggle'
! mkdir ~/.kaggle

In [None]:
#copy the 'kagle.json' into the directory
! cp kaggle.json ~/.kaggle/


In [None]:
#allocate the required permission for this file
! chmod 600 ~/.kaggle/kaggle.json


# **Download and Unzip the data**

In [None]:
# download the dataset
! kaggle datasets download nikosfragkis/greek-movies-dataset

Downloading greek-movies-dataset.zip to /content
 32% 9.00M/28.2M [00:00<00:00, 23.3MB/s]
100% 28.2M/28.2M [00:00<00:00, 62.4MB/s]


In [None]:
# unzip the dataset
! unzip 'greek-movies-dataset.zip'

Archive:  greek-movies-dataset.zip
  inflating: Athinorama_movies_dataset.csv  


In [None]:
# Import the required libraries
import pandas as pd
import numpy as np

In [None]:
# read the dataset
df = pd.read_csv('Athinorama_movies_dataset.csv')

In [None]:
df.head()

Unnamed: 0,id number,greek title,original title,category,director,movie lenght,movie date,author,review date,review,stars,label,mean of stars,number of reviews,url
0,0,Και Τώρα Κάτι Τελείως Διαφορετικό,AND NOW FOR SOMETHING COMPLETELY DIFFERENT,Κωμωδία,Ίαν Μακ Νότον,88,1971,Marvin,2002,Φοβερή η σύλληψη του χιούμορ από τους Python''...,4 αστεράκΑ,4.0,4.5,2,https://www.athinorama.gr/cinema/movieratings....
1,1,Και Τώρα Κάτι Τελείως Διαφορετικό,AND NOW FOR SOMETHING COMPLETELY DIFFERENT,Κωμωδία,Ίαν Μακ Νότον,88,1971,Χριστόφορος Ζώνας,2002,"Από τις καλύτερς στιγμές των Μ.Π., ισάξιο μόνο...",5 αστεράκΑ,5.0,4.5,2,https://www.athinorama.gr/cinema/movieratings....
2,2,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,Vandim,2020,Κλασικό! Από τις καλύτερες και ανατρεπτικότερε...,4 αστεράκΑ,4.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....
3,3,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,dH,2015,Μου θυμιζει τα κόμικ του Αστερίξ που όσο μεγάλ...,5 αστεράκΑ,5.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....
4,4,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,Orestis,2014,Το κάτι άλλο!,5 αστεράκΑ,5.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....


In [None]:
# check the shape of the matrix
df.shape

(148795, 15)

In [None]:
#import required libraries
import re
import nltk
import spacy
import string

In [None]:
#null values
df.isnull().sum().sum()

55431

In [None]:
#total values
len(df)

148795

In [None]:
#check which column has null values
df.isna().sum()

id number                0
greek title              0
original title       46290
category                 0
director                 0
movie lenght             0
movie date               0
author                9141
review date              0
review                   0
stars                    0
label                    0
mean of stars            0
number of reviews        0
url                      0
dtype: int64

# **Data preprocessing**
The dataset should be preprocessed in order to become more simple and suitable for the vectorization part

In [None]:
#make all the characters lowercase
df['review_lower'] = df['review'].str.lower()
df.head()


Unnamed: 0,id number,greek title,original title,category,director,movie lenght,movie date,author,review date,review,stars,label,mean of stars,number of reviews,url,review_lower
0,0,Και Τώρα Κάτι Τελείως Διαφορετικό,AND NOW FOR SOMETHING COMPLETELY DIFFERENT,Κωμωδία,Ίαν Μακ Νότον,88,1971,Marvin,2002,Φοβερή η σύλληψη του χιούμορ από τους Python''...,4 αστεράκΑ,4.0,4.5,2,https://www.athinorama.gr/cinema/movieratings....,φοβερή η σύλληψη του χιούμορ από τους python''...
1,1,Και Τώρα Κάτι Τελείως Διαφορετικό,AND NOW FOR SOMETHING COMPLETELY DIFFERENT,Κωμωδία,Ίαν Μακ Νότον,88,1971,Χριστόφορος Ζώνας,2002,"Από τις καλύτερς στιγμές των Μ.Π., ισάξιο μόνο...",5 αστεράκΑ,5.0,4.5,2,https://www.athinorama.gr/cinema/movieratings....,"από τις καλύτερς στιγμές των μ.π., ισάξιο μόνο..."
2,2,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,Vandim,2020,Κλασικό! Από τις καλύτερες και ανατρεπτικότερε...,4 αστεράκΑ,4.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,κλασικό! από τις καλύτερες και ανατρεπτικότερε...
3,3,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,dH,2015,Μου θυμιζει τα κόμικ του Αστερίξ που όσο μεγάλ...,5 αστεράκΑ,5.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,μου θυμιζει τα κόμικ του αστερίξ που όσο μεγάλ...
4,4,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,Orestis,2014,Το κάτι άλλο!,5 αστεράκΑ,5.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,το κάτι άλλο!


In [None]:
# remove special characters
df['review_lower'] = df['review_lower'].str.replace('[^\w\s]', '', regex=True)
df.head()


Unnamed: 0,id number,greek title,original title,category,director,movie lenght,movie date,author,review date,review,stars,label,mean of stars,number of reviews,url,review_lower
0,0,Και Τώρα Κάτι Τελείως Διαφορετικό,AND NOW FOR SOMETHING COMPLETELY DIFFERENT,Κωμωδία,Ίαν Μακ Νότον,88,1971,Marvin,2002,Φοβερή η σύλληψη του χιούμορ από τους Python''...,4 αστεράκΑ,4.0,4.5,2,https://www.athinorama.gr/cinema/movieratings....,φοβερή η σύλληψη του χιούμορ από τους pythons ...
1,1,Και Τώρα Κάτι Τελείως Διαφορετικό,AND NOW FOR SOMETHING COMPLETELY DIFFERENT,Κωμωδία,Ίαν Μακ Νότον,88,1971,Χριστόφορος Ζώνας,2002,"Από τις καλύτερς στιγμές των Μ.Π., ισάξιο μόνο...",5 αστεράκΑ,5.0,4.5,2,https://www.athinorama.gr/cinema/movieratings....,από τις καλύτερς στιγμές των μπ ισάξιο μόνο το...
2,2,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,Vandim,2020,Κλασικό! Από τις καλύτερες και ανατρεπτικότερε...,4 αστεράκΑ,4.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,κλασικό από τις καλύτερες και ανατρεπτικότερες...
3,3,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,dH,2015,Μου θυμιζει τα κόμικ του Αστερίξ που όσο μεγάλ...,5 αστεράκΑ,5.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,μου θυμιζει τα κόμικ του αστερίξ που όσο μεγάλ...
4,4,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,Orestis,2014,Το κάτι άλλο!,5 αστεράκΑ,5.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,το κάτι άλλο


In [None]:
# import nltk library and download stop words
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
print(stopwords.words('greek'))

['αλλα', 'αν', 'αντι', 'απο', 'αυτα', 'αυτεσ', 'αυτη', 'αυτο', 'αυτοι', 'αυτοσ', 'αυτουσ', 'αυτων', 'αἱ', 'αἳ', 'αἵ', 'αὐτόσ', 'αὐτὸς', 'αὖ', 'γάρ', 'γα', 'γα^', 'γε', 'για', 'γοῦν', 'γὰρ', "δ'", 'δέ', 'δή', 'δαί', 'δαίσ', 'δαὶ', 'δαὶς', 'δε', 'δεν', "δι'", 'διά', 'διὰ', 'δὲ', 'δὴ', 'δ’', 'εαν', 'ειμαι', 'ειμαστε', 'ειναι', 'εισαι', 'ειστε', 'εκεινα', 'εκεινεσ', 'εκεινη', 'εκεινο', 'εκεινοι', 'εκεινοσ', 'εκεινουσ', 'εκεινων', 'ενω', 'επ', 'επι', 'εἰ', 'εἰμί', 'εἰμὶ', 'εἰς', 'εἰσ', 'εἴ', 'εἴμι', 'εἴτε', 'η', 'θα', 'ισωσ', 'κ', 'καί', 'καίτοι', 'καθ', 'και', 'κατ', 'κατά', 'κατα', 'κατὰ', 'καὶ', 'κι', 'κἀν', 'κἂν', 'μέν', 'μή', 'μήτε', 'μα', 'με', 'μεθ', 'μετ', 'μετά', 'μετα', 'μετὰ', 'μη', 'μην', 'μἐν', 'μὲν', 'μὴ', 'μὴν', 'να', 'ο', 'οι', 'ομωσ', 'οπωσ', 'οσο', 'οτι', 'οἱ', 'οἳ', 'οἷς', 'οὐ', 'οὐδ', 'οὐδέ', 'οὐδείσ', 'οὐδεὶς', 'οὐδὲ', 'οὐδὲν', 'οὐκ', 'οὐχ', 'οὐχὶ', 'οὓς', 'οὔτε', 'οὕτω', 'οὕτως', 'οὕτωσ', 'οὖν', 'οὗ', 'οὗτος', 'οὗτοσ', 'παρ', 'παρά', 'παρα', 'παρὰ', 'περί', 'περὶ', 'πο

In [None]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [None]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [None]:
# import a tokenizer which is suitable for Greek language
from nltk.tokenize.toktok import ToktokTokenizer

In [None]:
tokenizer=ToktokTokenizer()

In [None]:
stopword_list=nltk.corpus.stopwords.words('greek')

In [None]:
#remove stop words
def remove_stopwords(text):
  tokens = tokenizer.tokenize(text)
  tokens = [token.strip() for token in tokens]
  filtered_tokens = [token for token in tokens if token not in stopword_list]
  filtered_text = ' '.join(filtered_tokens)
  return filtered_text

In [None]:
df['review_lower'] = df['review_lower'].apply(remove_stopwords)

In [None]:
df.head()

Unnamed: 0,id number,greek title,original title,category,director,movie lenght,movie date,author,review date,review,stars,label,mean of stars,number of reviews,url,review_lower
0,0,Και Τώρα Κάτι Τελείως Διαφορετικό,AND NOW FOR SOMETHING COMPLETELY DIFFERENT,Κωμωδία,Ίαν Μακ Νότον,88,1971,Marvin,2002,Φοβερή η σύλληψη του χιούμορ από τους Python''...,4 αστεράκΑ,4.0,4.5,2,https://www.athinorama.gr/cinema/movieratings....,φοβερή σύλληψη χιούμορ από τους pythons ιδανικ...
1,1,Και Τώρα Κάτι Τελείως Διαφορετικό,AND NOW FOR SOMETHING COMPLETELY DIFFERENT,Κωμωδία,Ίαν Μακ Νότον,88,1971,Χριστόφορος Ζώνας,2002,"Από τις καλύτερς στιγμές των Μ.Π., ισάξιο μόνο...",5 αστεράκΑ,5.0,4.5,2,https://www.athinorama.gr/cinema/movieratings....,από καλύτερς στιγμές μπ ισάξιο μόνο monty pyth...
2,2,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,Vandim,2020,Κλασικό! Από τις καλύτερες και ανατρεπτικότερε...,4 αστεράκΑ,4.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,κλασικό από καλύτερες ανατρεπτικότερες κωμωδίε...
3,3,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,dH,2015,Μου θυμιζει τα κόμικ του Αστερίξ που όσο μεγάλ...,5 αστεράκΑ,5.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,μου θυμιζει κόμικ αστερίξ όσο μεγάλωνα ανακάλυ...
4,4,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,Orestis,2014,Το κάτι άλλο!,5 αστεράκΑ,5.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,κάτι άλλο


In [None]:
#drop values of label that are equal with 3
df = df [df['label'] != 3]

In [None]:
#turning every value that is equal or higher from 3.5 into 1 and every other into 0
def sentiment(n) :
  return 1 if n >= 3.5 else 0
df['sentiment'] = df ['label']. apply(sentiment)

In [None]:
df.head()

Unnamed: 0,id number,greek title,original title,category,director,movie lenght,movie date,author,review date,review,stars,label,mean of stars,number of reviews,url,review_lower,sentiment
0,0,Και Τώρα Κάτι Τελείως Διαφορετικό,AND NOW FOR SOMETHING COMPLETELY DIFFERENT,Κωμωδία,Ίαν Μακ Νότον,88,1971,Marvin,2002,Φοβερή η σύλληψη του χιούμορ από τους Python''...,4 αστεράκΑ,4.0,4.5,2,https://www.athinorama.gr/cinema/movieratings....,φοβερή σύλληψη χιούμορ από τους pythons ιδανικ...,1
1,1,Και Τώρα Κάτι Τελείως Διαφορετικό,AND NOW FOR SOMETHING COMPLETELY DIFFERENT,Κωμωδία,Ίαν Μακ Νότον,88,1971,Χριστόφορος Ζώνας,2002,"Από τις καλύτερς στιγμές των Μ.Π., ισάξιο μόνο...",5 αστεράκΑ,5.0,4.5,2,https://www.athinorama.gr/cinema/movieratings....,από καλύτερς στιγμές μπ ισάξιο μόνο monty pyth...,1
2,2,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,Vandim,2020,Κλασικό! Από τις καλύτερες και ανατρεπτικότερε...,4 αστεράκΑ,4.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,κλασικό από καλύτερες ανατρεπτικότερες κωμωδίε...,1
3,3,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,dH,2015,Μου θυμιζει τα κόμικ του Αστερίξ που όσο μεγάλ...,5 αστεράκΑ,5.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,μου θυμιζει κόμικ αστερίξ όσο μεγάλωνα ανακάλυ...,1
4,4,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,Orestis,2014,Το κάτι άλλο!,5 αστεράκΑ,5.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,κάτι άλλο,1


In [None]:
# count values of 0 and 1
df.sentiment.value_counts()

1    78085
0    53023
Name: sentiment, dtype: int64

In [None]:
#download the greek pipeline for lemmatizer
! python -m spacy download el_core_news_md

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting el-core-news-md==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/el_core_news_md-3.3.0/el_core_news_md-3.3.0-py3-none-any.whl (42.9 MB)
[K     |████████████████████████████████| 42.9 MB 1.1 MB/s 
Installing collected packages: el-core-news-md
Successfully installed el-core-news-md-3.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('el_core_news_md')


In [None]:

#import required libraries
import spacy
import el_core_news_md

In [None]:
nlp = el_core_news_md.load()

In [None]:
# text lemmatization a process that transform each word to its lemma and reduce the amount of unique words
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

In [None]:
df['review_lower'] = df['review_lower'].apply(lemmatize_text)

In [None]:
df.head()

Unnamed: 0,id number,greek title,original title,category,director,movie lenght,movie date,author,review date,review,stars,label,mean of stars,number of reviews,url,review_lower,sentiment
0,0,Και Τώρα Κάτι Τελείως Διαφορετικό,AND NOW FOR SOMETHING COMPLETELY DIFFERENT,Κωμωδία,Ίαν Μακ Νότον,88,1971,Marvin,2002,Φοβερή η σύλληψη του χιούμορ από τους Python''...,4 αστεράκΑ,4.0,4.5,2,https://www.athinorama.gr/cinema/movieratings....,φοβερός σύλληψη χιούμορ από ο pythons ιδανικός...,1
1,1,Και Τώρα Κάτι Τελείως Διαφορετικό,AND NOW FOR SOMETHING COMPLETELY DIFFERENT,Κωμωδία,Ίαν Μακ Νότον,88,1971,Χριστόφορος Ζώνας,2002,"Από τις καλύτερς στιγμές των Μ.Π., ισάξιο μόνο...",5 αστεράκΑ,5.0,4.5,2,https://www.athinorama.gr/cinema/movieratings....,από καλύτερςς στιγμή μπ ισάξιο μόνο monty pyth...,1
2,2,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,Vandim,2020,Κλασικό! Από τις καλύτερες και ανατρεπτικότερε...,4 αστεράκΑ,4.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,κλασικός από καλύτερες ανατρεπτικός κωμωδία όλ...,1
3,3,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,dH,2015,Μου θυμιζει τα κόμικ του Αστερίξ που όσο μεγάλ...,5 αστεράκΑ,5.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,μου θυμιζω κόμικ αστερίξ όσο μεγάλωνα ανακάλυπ...,1
4,4,Το Αδελφάτο των Ιπποτών της Ελεεινής Τραπέζης,Monty Python and the Holy Grail,Κωμωδία,"Τέρι Γκίλιαμ, Τέρι Τζόουνς",91,1975,Orestis,2014,Το κάτι άλλο!,5 αστεράκΑ,5.0,4.0,32,https://www.athinorama.gr/cinema/movieratings....,κάτι άλλος,1


# **Text Vectorization**
Machine learning algorithms cannot understand words that is why we have to transform them into a readable form and in this case into numbers-vectors. I used two types of vectorizers. One of them is countvectorizer which counts only the number of times a word appears in the document. The other one is the Tfidf vectorizer which takes into account not only the frequency of them in the document but also accross the corpus.

# **Training the model**
I tried 3 algorithms to train my model in order to see which of them gives the best results. One of them was logistic regression is a supervised classification algorithm which models the data using the sigmoid function. Another algorithm that I tested was support vector machine also a supervised machine learning algorithm which is used both for classification and regression and finds a hyperplane in an N-dimensional space that distinctly classfiesd the datapoints. The last algorithm that i tested was the Random forest which is also a supervised machine learning algorithm that is used in classification and regression problems. This algorithm builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

In [None]:
#import needed modules
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

In [None]:
#define X and Y
X =df['review_lower']
Y = df['sentiment']

In [None]:
#split the data into a train and a test part
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, test_size=0.25, random_state=0)

In [None]:
#Vectoring the data with bigrams
cv = CountVectorizer(ngram_range=(1, 2))
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)


In [None]:
#Logistic regression
lrc = LogisticRegression(penalty='l2', max_iter=500, C=1, random_state=42)
lrc.fit(ctmTr, Y_train)

LogisticRegression(C=1, max_iter=500, random_state=42)

In [None]:
#Accuracy score
lrc_score= lrc.score(X_test_dtm, Y_test)
print('Results for Logistic Regression with CountVectorizer')
print (lrc_score)

Results for Logistic Regression with CountVectorizer
0.852701589529243


In [None]:
#import required library
from sklearn.metrics import confusion_matrix

In [None]:
# predicting the labels for test data
Y_pred_lrc = lrc.predict(X_test_dtm)

In [None]:
#Confusion matrix
cm_lrc = confusion_matrix(Y_test, Y_pred_lrc)

In [None]:
tn , fp, fn, tp = confusion_matrix(Y_test, Y_pred_lrc).ravel()
print(tn, fp, fn, tp)

10473 2945 1883 17476


In [None]:
#True positive and true negative rates
tpr_lrc = round(tp/(tp + fn), 4)
tnr_lrc = round(tn/(tn+fp), 4)

In [None]:
print(tpr_lrc, tnr_lrc)

0.9027 0.7805


In [None]:
#Vectoring the data with trigrams
cv = CountVectorizer(ngram_range=(1, 3))
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)

In [None]:
#Logistic regression
lrc = LogisticRegression(penalty='l2', max_iter=500, C=1, random_state=42)
lrc.fit(ctmTr, Y_train)

LogisticRegression(C=1, max_iter=500, random_state=42)

In [None]:
#Accuracy score
lrc_score= lrc.score(X_test_dtm, Y_test)
print('Results for Logistic Regression with CountVectorizer')
print (lrc_score)

Results for Logistic Regression with CountVectorizer
0.852610061933673


In [None]:
# predicting the labels for test data
Y_pred_lrc = lrc.predict(X_test_dtm)

In [None]:
#Confusion matrix
cm_lrc = confusion_matrix(Y_test, Y_pred_lrc)

In [None]:
tn , fp, fn, tp = confusion_matrix(Y_test, Y_pred_lrc).ravel()
print(tn, fp, fn, tp)

10426 2992 1839 17520


In [None]:
#True positive and true negative rates
tpr_lrc = round(tp/(tp + fn), 4)
tnr_lrc = round(tn/(tn+fp), 4)

In [None]:
print(tpr_lrc, tnr_lrc)

0.905 0.777


In [None]:
#Support Vector Machine
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, test_size=0.25, random_state=0)


In [None]:
#Vectorizing the text data with bigram
cv = CountVectorizer(ngram_range=(1, 2))
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)

In [None]:
from sklearn import svm

In [None]:
#Training the model
svclc =  svm.SVC()
svclc.fit(ctmTr, Y_train)

SVC()

In [None]:
#Accuracy score
svclc_score = svclc.score(X_test_dtm, Y_test)
print ('Results for Support Vector Machine with CountVectorizer')
print(svclc_score)

Results for Support Vector Machine with CountVectorizer
0.8297891814382036


In [None]:
#Predicting the labels for test data
Y_pred_svc = svclc.predict(X_test_dtm)

In [None]:
#confusion matrix
cm_svc = confusion_matrix(Y_test, Y_pred_svc)

In [None]:
tn, fp, fn, tp = confusion_matrix(Y_test, Y_pred_svc).ravel()
print(tn, fp, fn, tp)

9521 3897 1682 17677


In [None]:
#True positive and true negative rates
tpr_svc = round(tp/(tp+fn), 4)
tnr_svc = round(tn/(tn+fp), 4)
print(tpr_svc, tnr_svc)

0.9131 0.7096


In [None]:
#Vectorizing the text data with trigram
cv = CountVectorizer(ngram_range=(1, 3))
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)

In [None]:
#Training the model
svclc =  svm.SVC()
svclc.fit(ctmTr, Y_train)

In [None]:
#random forest
from sklearn.ensemble import RandomForestClassifier

In [None]:
#split data into train and test part
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, test_size=0.25, random_state=0)

In [None]:
#Vectorizing data with bigram
cv = CountVectorizer(ngram_range=(1, 2))
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)

In [None]:
#Training the model
forestc = RandomForestClassifier(n_estimators=50)
forestc.fit(ctmTr, Y_train)



RandomForestClassifier(n_estimators=50)

In [None]:
#Accuracy score
forestc_score = forestc.score(X_test_dtm, Y_test)
print("Results for Random Forest Classifier with CountVectorizer")
print(forestc_score)

Results for Random Forest Classifier with CountVectorizer
0.8105988955670135


In [None]:
#predicting the labels for test data
Y_pred_forestc= forestc.predict(X_test_dtm)

In [None]:
#confusion matrix
cm_forestc = confusion_matrix(Y_test, Y_pred_forestc)


In [None]:
tn, fp, fn, tp = confusion_matrix(Y_test, Y_pred_forestc).ravel()
print(tn, fp, fn, tp)

8645 4773 1435 17924


In [None]:
#true positive and true negative rates
tpr_forestc = round(tp/(tp + fn), 4)
tnr_forestc = round(tn/(tn+fp), 4)
print(tpr_forestc, tnr_forestc)

0.9259 0.6443


In [None]:
#Vectorizing data with trigram
cv = CountVectorizer(ngram_range=(1, 3))
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)

In [None]:
#Training the model
forestc = RandomForestClassifier(n_estimators=50)
forestc.fit(ctmTr, Y_train)

KeyboardInterrupt: ignored

In [None]:
#Tfidf Vectorizer
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, test_size=0.25, random_state=0)


In [None]:
#import the required vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer()

In [None]:
#vectorizing text data
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [None]:
#training the model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
lr.fit(X_train_vec, Y_train)

LogisticRegression(C=1, max_iter=500, random_state=42)

In [None]:
#Accuracy score
lr_score = lr.score(X_test_vec, Y_test)
print('Results for logistic Regression with tfidf')
print(lr_score)

Results for logistic Regression with tfidf
0.840833511303658


In [None]:
#predicting the labels for test data
Y_pred_lr = lr.predict(X_test_vec)

In [None]:
#confusion matrix
from sklearn.metrics import confusion_matrix
cm_lr = confusion_matrix(Y_test, Y_pred_lr)

In [None]:
#True positive and true negative rates
tn, fp, fn, tp = confusion_matrix(Y_test, Y_pred_lr).ravel()
print(tn, fp, fn, tp)
tpr_lr = round(tp/(tp+fn), 4)
tnr_lr= round(tn/(tn+fp), 4)
print(tpr_lr, tnr_lr)

10265 3153 2064 17295
0.8934 0.765


In [None]:
#support Vector Machine
#split the data into train and test part
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, test_size=0.25, random_state=0)


In [None]:
#vectorizing the data
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [None]:
#import the required algorithm
from sklearn import svm

In [None]:
#train the model and get the accuracy score
#params = {'kernel': ('linear', 'rbf'), 'C':[1, 10, 100]}
svcl = svm.SVC(kernel='rbf')
#clf_sv = GridsearchCV(svcl, params)
svcl.fit(X_train_vec, Y_train)
svcl_score = svcl.score(X_test_vec, Y_test)
print('Result for Support Vector Machine with tfidf')
print(svcl_score)

Result for Support Vector Machine with tfidf
0.8478201177655063


In [None]:
#Predicting the labels for test data
Y_pred_sv = svcl.predict(X_test_vec)

In [None]:
#confusion matrix
cm_sv = confusion_matrix(Y_test, Y_pred_sv)

In [None]:
tn, fp, fn, tp = confusion_matrix(Y_test, Y_pred_sv).ravel()
print(tn, fp, fn, tp)

10335 3083 1905 17454


In [None]:
#True positive and true negative rates
tpr_sv = round(tp/(tp+fn), 4)
tnr_sv = round(tn/(tn+fp), 4)

In [None]:
print(tpr_sv, tnr_sv)

0.9016 0.7702


In [None]:
#random forest classifier
#split the data into train and test part
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, test_size=0.25, random_state=0)

In [None]:
#vectorizing the data
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [None]:
#Training the model
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X_train_vec, Y_train)

RandomForestClassifier(n_estimators=50)

In [None]:
#Accuracy score
forest_score =forest.score(X_test_vec, Y_test)
print('Results for Random Forest Classifier with tfidf')
print(forest_score)

Results for Random Forest Classifier with tfidf
0.8075784849132013


In [None]:
#predicting the labels for test data
Y_pred_forest = forest.predict(X_test_vec)

In [None]:
#confusion matrix
cm_forest= confusion_matrix(Y_test, Y_pred_forest)

In [None]:
tn, fp, fn, tp = confusion_matrix(Y_test, Y_pred_forest).ravel()
print(tn, fp, fn, tp)

9053 4365 1942 17417


In [None]:
#true positive and true negative rates
tpr_forest = round(tp/(tp+fn), 4)
tnr_forest = round(tn/(tn+fp), 4)
print(tpr_forest, tnr_forest)

0.8997 0.6747


In [None]:
import pickle

In [None]:
pickle.dump(svcl, open('svcl', 'wb'))

In [None]:
pickle.dump(lrc, open('lrc', 'wb'))

### **Results**/**Limitations**/**Suggestions**


*   The best results are provided by the logistic regression with tfidf vectorizer with and accuracy score of 84%

*   The preprocessing of the dataset could be improved with the use of a spell checker


*   Of course the combined method with the use of the lexicon except from the machine learning algorithms could improve the results
*   bigger datasets and different algorithms could be tried for improvement


