# Capstone Project Domain #11 ( Sentiment Analysis in Twitter )

Tweet text along with other features has been extracted from different from different sources (domain) using APIs.
Each row of the dataset contains sentiment code (negative, positive and neutral embedded in Twit-id column. The task is to predict whether a tweet contains positive, negative, or neutral sentiment. This is a supervised learning task where given a text string.

## Step 3 - Basic Classification Test using Random Forest Classifier

#### In this file we will be testing the data with basic random forest classification
### Flow :-

1. Read the Output file from Step 2 as the input for Step 3
2. Check for missing values
3. Drop NULL tweet rows
4. Arranging columns required for text processing
5. Setting the class column as the category for classification
6. Setting the features and labels array from the data frame
7. Vectoriztion using TF-IDF - Converting the Text to numbers to apply Machine Learning
8. Model Performance metrics.

### Input File - Step2_PreProcessing_Group33_Cleaned_Tweets.csv

In [1]:
# Library Imports

import numpy as np 
print('numpy: {}'.format(np.__version__))

import pandas as pd
print('pandas: {}'.format(pd.__version__))

import re
print('re: {}'.format(re.__version__))

import nltk
print('nltk: {}'.format(nltk.__version__))

import matplotlib.pyplot as plt

%matplotlib inline

numpy: 1.18.5
pandas: 1.0.5
re: 2.2.1
nltk: 3.5


### Data Input / Output - Folders where the input data will be read and output will be stored.

In [2]:
InputdataFolder = "/Users/aravindv/Wind/BITS Pilani/PGP - AIML/Course/Course 7 - Capstone Project"
OutputFolder = "/Users/aravindv/Wind/BITS Pilani/PGP - AIML/Course/Course 7 - Capstone Project/Output"
MLOutfolder =  "/Users/aravindv/Wind/BITS Pilani/PGP - AIML/Course/Course 7 - Capstone Project/ML"

##  Reading the Pre-Processed data from round 2

In [3]:
# Reading the Second Round PreProcessed Data
# Data read - All Required data  are in datafolder
cleaned_tweets_df = pd.read_csv(OutputFolder+"/Step2_PreProcessing_Group33_Cleaned_Tweets.csv")
print(cleaned_tweets_df.shape)

(30155, 11)


### 1. Finding missing values

In [33]:
# Function to find the missing values in each column

def find_missing_values_func(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

In [34]:
# Invoking the find_missing_values_func() with data frame of original tweets

columnsWiseMissingValue = find_missing_values_func(cleaned_tweets_df) 
print(columnsWiseMissingValue)

Selected dataframe has 3 columns.
There are 0 columns that have missing values.
Empty DataFrame
Columns: [Missing Values, % of Total Values]
Index: []


### 2. Dropping NULL tweet rows

In [36]:
# Drop NULL Tweet-Text  rows as we use tweet text for text processing 
cleaned_tweets_df = cleaned_tweets_df.dropna(subset=["Clean_tweet"])

In [37]:
#Check missing_values again , if any
columnsWiseMissingValue = find_missing_values_func(cleaned_tweets_df) 
print(columnsWiseMissingValue)

Selected dataframe has 3 columns.
There are 0 columns that have missing values.
Empty DataFrame
Columns: [Missing Values, % of Total Values]
Index: []


In [38]:
cleaned_tweets_df.dtypes

tweet_id         object
Clean_tweet      object
class          category
dtype: object

## Getting Data Ready for Text Processing

### 1. Columns required for Text Processing

In [39]:
# For text processing We arrange the tweet_id, Clean_tweet and class
ArrangeCollist = ['tweet_id', 
                  'Clean_tweet', 
                  'class' ]  # Label ]


cleaned_tweets_df = cleaned_tweets_df.reindex(columns=ArrangeCollist)

In [40]:
# Check missing_values again , if any
columnsWiseMissingValue = find_missing_values_func(cleaned_tweets_df) 
print(columnsWiseMissingValue)

Selected dataframe has 3 columns.
There are 0 columns that have missing values.
Empty DataFrame
Columns: [Missing Values, % of Total Values]
Index: []


### 2. Setting the class column as the category for classification

In [42]:
cleaned_tweets_df["class"] = cleaned_tweets_df["class"].astype('category')

In [43]:
cleaned_tweets_df.dtypes

tweet_id         object
Clean_tweet      object
class          category
dtype: object

In [44]:
cleaned_tweets_df.head(5)

Unnamed: 0,tweet_id,Clean_tweet,class
0,neu-GG-Tweet-11945,just land my ear hurt,2
1,neu-GG-Tweet-11944,ouch follow asot tweetdeck exceed tweet limit,2
2,neu-GG-Tweet-11943,realli want to see one would go lmfao,2
3,neu-GG-Tweet-11942,ahh repli random follow do not how sad haha,2
4,neu-GG-Tweet-11941,awwww did not get hero,2


### 3. Setting the features and labels array from the data frame

In [45]:
features = cleaned_tweets_df.iloc[:, 1].values
labels = cleaned_tweets_df.iloc[:, -1].values

In [46]:
features.shape

(30053,)

In [47]:
# Processing the features array again to remove special characters, single characters and numbers
processed_features = []

for sentence in range(0, len(features)):
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', str(features[sentence]))

    # remove all single characters
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

    # Remove single characters from the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) 

    # Substituting multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

    # Removing prefixed 'b'
    processed_feature = re.sub(r'^b\s+', '', processed_feature)

    # Converting to Lowercase
    processed_feature = processed_feature.lower()

    processed_features.append(processed_feature)

### 4. Vectoriztion using TF-IDF - Converting the Text to numbers to apply Machine Learning

In [26]:
# Library Imports for Vectoriztion
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [48]:
# Initializing the Vectorizer with parameters
vectorizer = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))
processed_features = vectorizer.fit_transform(processed_features).toarray()

#### Creating the Training and Test Data Sets

In [49]:
# Split the dataframe 80:20 preserve the distribution of class - use stratify
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(processed_features, labels, test_size=0.2, random_state=0 ,stratify = cleaned_tweets_df['class'])

In [50]:
# Initializing the Random Forest Classifier and fitting the model
from sklearn.ensemble import RandomForestClassifier

text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)

RandomForestClassifier(n_estimators=200, random_state=0)

In [51]:
# Calculating the Predictions from the classifier
predictions = text_classifier.predict(X_test)

In [52]:
# Printing the Metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve

print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))

[[1412  481   95]
 [ 508 1736  201]
 [ 407  817  354]]
              precision    recall  f1-score   support

           0       0.61      0.71      0.65      1988
           1       0.57      0.71      0.63      2445
           2       0.54      0.22      0.32      1578

    accuracy                           0.58      6011
   macro avg       0.57      0.55      0.54      6011
weighted avg       0.58      0.58      0.56      6011

0.5825985692896357


# ----DONE----