# Sentiment Analysis Prediction

## Table of Contents:
* [0. Problem Description](#Problem)
* [1. Exploratory Data Analysis (EDA)](#EDA)
* [2. Feature Engineering](#Feat)
  * [2.1. Correlation](#Corr)
  * [2.2. NZV](#NZV)
  * [2.3. Split Train & Test](#Split)
  * [2.4. RFE](#RFE)
  * [2.5. Standardization & PCA](#PCA)
* [3. Predictions](#Pred)
  * [3.1. Average Sentiment for iPhone](#avg)

# <a class="anchor" id="Problem"> 0. Problem Description </a>

The idea behind this project is to scrap the web to extract pages containing previously chosen words related to sentiment or feeling about a certain phone, so that we could provide an answer of the type ‘brand X is preferred among people at this moment’.

An NGO called Common Crawl, collects all the internet’s webpages (billions) created each month, once a month. I used this service to download last month’s data (only the data that matched our conditions).

In [1]:
# Libraries
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import math
import random
import plotly.express as px

# PCA
from sklearn.decomposition import PCA
from sklearn import preprocessing

# RFE & MODELLING
from sklearn.preprocessing import MinMaxScaler #Standardization
from sklearn.feature_selection import VarianceThreshold #NZV
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

#Cross Validation
from sklearn.model_selection import train_test_split
from math import sqrt

In [2]:
# Load Datasets
random.seed(203)
import os
os.getcwd()
os.chdir('/home/s/Desktop/Dropbox/Documents/Python/Mod4 Ubiqum/Datasets')

galaxy = pd.read_csv('galaxy_smallmatrix_labeled_9d.csv')
iphone = pd.read_csv('iphone_smallmatrix_labeled_8d.csv')
iphone_test = pd.read_csv('concatenated_factors.csv')

iphone["iphonesentiment"] = iphone["iphonesentiment"].replace(0,1)
iphone["iphonesentiment"] = iphone["iphonesentiment"].replace(5,4)

del iphone_test["id"]

# <a class="anchor" id="Feat"> 2. Feature Engineering </a>

## <a class="anchor" id="Corr"> 2.1. Correlation </a>

In [3]:
# Collinear columns WITHOUT any relationship to the dependant variable
def find_correlation(df, thresh=0.9):
   corrMatrix = iphone.corr()
   corrMatrix.loc[:,:] =  np.tril(corrMatrix, k=-1)
   already_in = set()
   result = []
   for col in corrMatrix:
       perfect_corr = corrMatrix[col][corrMatrix[col] > thresh].index.tolist()
       if perfect_corr and col not in already_in:
           already_in.update(set(perfect_corr))
           perfect_corr.append(col)
           result.append(perfect_corr)
   select_nested = [f[1:] for f in result]
   select_flat = [i for j in select_nested for i in j]
   return select_flat

print(find_correlation(iphone, thresh=0.9))

#Dropping collinear columns between themselves (from 59 to 44!)
iphone_test = iphone_test.drop(columns=find_correlation(iphone, thresh=0.9))

iphone = iphone.drop(columns=find_correlation(iphone, thresh=0.9))
print(iphone.shape, iphone_test.shape)

['iphone', 'htcphone', 'nokiacamunc', 'nokiadisneg', 'nokiacampos', 'samsungdispos', 'nokiadispos', 'samsungdisneg', 'nokiadisunc', 'nokiaperunc', 'nokiaperpos', 'iosperunc', 'iosperpos', 'googleperpos']
(12973, 45) (53485, 44)


## <a class="anchor" id="NZV"> 2.2. NZV </a>

## <a class="anchor" id="Split"> 2.3. Split Train & Test </a>

In [4]:
#Defining Train & Label
x = iphone.iloc[:,:-1] #Good practice to select all columns but the last one
y = iphone.iloc[:,-1] #Good practice to get the last column

x_train = x
y_train = y

x_test = iphone_test
#Defining Train & Label



In [5]:
x_train.shape, x_test.shape

((12973, 44), (53485, 44))

## <a class="anchor" id="RFE"> 2.4. RFE </a>

In [6]:
RF = RandomForestClassifier(n_estimators=50, max_depth=10, random_state=0)
rfe = RFE(estimator=RF, step=3)
rfe = rfe.fit(x_train, y_train)
## print summaries for the selection of attributes
selected_rfe = pd.DataFrame({'Feature': list(x_train.columns),'Ranking': rfe.ranking_})

print(selected_rfe.sort_values(by='Ranking'))
x_train = rfe.transform(x_train)
x_test = rfe.transform(x_test)

          Feature  Ranking
0   samsunggalaxy        1
41      iosperneg        1
40      htcperunc        1
37   iphoneperunc        1
36      htcperneg        1
32   iphoneperneg        1
31      htcperpos        1
29  samsungperpos        1
28   iphoneperpos        1
27      htcdisunc        1
24   iphonedisunc        1
23      htcdisneg        1
20      htcdispos        1
18   iphonedispos        1
21   iphonedisneg        1
1      sonyxperia        1
13      htccamneg        1
3             ios        1
4   googleandroid        1
5    iphonecampos        1
8       htccampos        1
14   iphonecamunc        1
9    iphonecamneg        2
15  samsungcamunc        3
17      htccamunc        3
6   samsungcampos        3
42   googleperneg        4
25  samsungdisunc        4
30     sonyperpos        4
2     nokialumina        5
7      sonycampos        5
10  samsungcamneg        5
33  samsungperneg        6
38  samsungperunc        6
43   googleperunc        6
34     sonyperneg        7
3

## <a class="anchor" id="PCA"> 2.5. Standardization & PCA </a>

In [7]:
stand = MinMaxScaler().fit(x_train)
x_train = stand.transform(x_train)
pca = PCA(0.999)
pca.fit(x_train)
x_train = pd.DataFrame(pca.transform(x_train))
#Test
x_test = stand.transform(x_test)
x_test = pd.DataFrame(pca.transform(x_test))

# <a class="anchor" id="Pred"> 3. Predictions </a>

In [8]:
#RANDOM FOREST
RF = RandomForestClassifier(n_estimators =100).fit(x_train,y_train)
#print(cross_val_score(RF_B, x_train, y_train['galaxysentiment'], cv=10))

#Predictions
RF_pred = RF.predict(x_test)

## <a class="anchor" id="avg"> 3.1. Average sentiment for iPhone </a>

In [9]:
np.mean(RF_pred)

3.111283537440404