# Capstone Project Domain #11 ( Sentiment Analysis in Twitter )

Tweet text along with other features has been extracted from different from different sources (domain) using APIs.
Each row of the dataset contains sentiment code (negative, positive and neutral embedded in Twit-id column. The task is to predict whether a tweet contains positive, negative, or neutral sentiment. This is a supervised learning task where given a text string.

## Step 1 - Pre Processing

#### In this file all the Pre-Processing Steps will be performed.
### Flow :- 

1. Read the data from the Input File
2. Finding the missing values in Each Column
3. Creating the Label Columns from Tweet ID
4. Dropping rows with NULL tweets
5. Drop duplicate rows and tweet ID
6. Filling NULL tweet source and tweet by columns
7. Dropping Date Column
8. Cleaning Tweet_Source column to follow same format
9. Visualizing the data as PIE Charts
10. Visualizing the data as Word Cloud to see patterns.
11. Saving the Step 1 - Pre-Processing data to the file.

### Input File - Base_tweets_DataSetV3.xlsx
### Output File - Step1_PreProcessing_Group33_Cleaned_Tweets.csv

In [1]:
# Library Imports

import numpy as np 
print('numpy: {}'.format(np.__version__))

import pandas as pd
print('pandas: {}'.format(pd.__version__))

import re
print('re: {}'.format(re.__version__))

import nltk
print('nltk: {}'.format(nltk.__version__))

import matplotlib.pyplot as plt

%matplotlib inline

numpy: 1.18.5
pandas: 1.1.2
re: 2.2.1
nltk: 3.2.5


In [2]:
# Getting the Stop Words and Other Text Processing Libraries
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords
from termcolor import colored
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Data Input / Output - Folders where the input data will be read and output will be stored.

In [3]:
InputdataFolder = "/Users/aravindv/Wind/BITS Pilani/PGP - AIML/Course/Course 7 - Capstone Project"
OutputFolder = "/Users/aravindv/Wind/BITS Pilani/PGP - AIML/Course/Course 7 - Capstone Project/Output"
MLOutfolder =  "/Users/aravindv/Wind/BITS Pilani/PGP - AIML/Course/Course 7 - Capstone Project/ML"

In [4]:
# Reading the data file and storing in the dataframe tweets_original_df
tweets_original_df = pd.read_excel(InputdataFolder+"/Base_tweets_DataSetV3.xlsx")
print(tweets_original_df.shape)

FileNotFoundError: ignored

In [None]:
# Peek at the data
print(tweets_original_df.head(5))
print("--------------------------------")
print(tweets_original_df.dtypes)

## Pre - Processing Steps

### 1 . Finding the missing values in each column

In [None]:
# Function to find the missing values in each column

def find_missing_values_func(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

In [None]:
# Invoking the find_missing_values_func() with data frame of original tweets

columnsWiseMissingValue = find_missing_values_func(tweets_original_df) 
print(columnsWiseMissingValue)

### 2. Creating the Label Column from the Tweet ID

In [None]:
# The dataset contains a column called tweet_id which is used to get the Label 
# Extract label_id from tweet_id - First 3 character label_id

label = list(tweets_original_df['tweet_id'].str[:3])
tweets_original_df['label_id'] = pd.Series(label).values

In [None]:
# List of Tweets with Label Column Added
display(tweets_original_df.head(5))

### 3. Drop rows which contain NULL tweets as Text Processing will be done on the tweet column

In [None]:
# Drop NULL Tweet-Text  rows as we use tweet text for text processing 
tweets_original_df = tweets_original_df.dropna(subset=["Tweet"])

### 4. Drop Duplicate rows and tweet_id  - To be done

In [None]:
tweets_df.drop_duplicates(subset ="tweet_id", keep = False, inplace = True)

### 5. Filling Null Tweet Source and Tweet By columns

In [None]:
# Inpute Null Tweet_source as OTHER
column = 'Tweet_source'
tweets_original_df[column] = tweets_original_df[column].fillna("OTHER")

# Inpute Null Tweeted-By as Unknown
column = 'Tweeted-By'
tweets_original_df[column] = tweets_original_df[column].fillna("Unknown")

In [None]:
#Check missing_values again , if any
columnsWiseMissingValue = find_missing_values_func(tweets_original_df) 
print(columnsWiseMissingValue)

### 6. Dropping the Date Column as it is not required for modelling

In [None]:
#Dropping Date Created column from main dataframe as it has no use
tweets_original_df = tweets_original_df.drop(["Date Created"], axis=1)

In [None]:
print(tweets_original_df.shape)

In [None]:
tweets_original_df.columns

In [None]:
# Configuring the Plot Sizes
plot_size = plt.rcParams["figure.figsize"] 
print(plot_size[0]) 
print(plot_size[1])

plot_size[0] = 8
plot_size[1] = 6
plt.rcParams["figure.figsize"] = plot_size 

In [None]:
 tweets_original_df["Tweet_source"].value_counts() 

### 7. Cleaning  Tweet_source column to follow a single format

In [None]:
# Replace(['S-5','s5'] with 'S5')
tweets_original_df['Tweet_source'] = tweets_original_df['Tweet_source'].replace(['S-5'],'S5')

In [None]:
tweets_original_df.columns

# Visualizing the Data

In [None]:
# Plot Distribution
tweets_original_df.Tweet_source.value_counts().plot(kind='pie', autopct='%1.0f%%')

## Label distribution - Positive, Negative, & Neutral

In [None]:
# Check label_id 
tweets_original_df["label_id"].value_counts()

In [None]:
# Refine the graph

class_count = tweets_original_df['label_id'].value_counts() # Returned in descending order [4, 0]

plt.figure(figsize = (12, 8))
plt.bar(['Negative' , 'Positive' , 'Neutral'], height = class_count.values, color = ['b', 'g', 'r'])
for i, v in enumerate(class_count.values):
    plt.text(i - 0.1, v+300 , str(v))
    
plt.xlabel('Tweet sentiment')
plt.ylabel('Number of tweets')
plt.title('Count of tweets for each sentiment')

#### Country wise positive , Negative and Neutral - Map If possible

In [None]:
tweets_pos = tweets_df[(tweets_df.label_id.isin(["pos"]))]
tweets_neu = tweets_df[(tweets_df.label_id.isin(["neu"]))]
tweets_neg = tweets_df[(tweets_df.label_id.isin(["neg"]))]

In [None]:
tweets_pos["Country"].value_counts().plot(kind='pie', autopct='%1.0f%%', colors=["red", "yellow", "blue"])

In [None]:
tweets_neg["Country"].value_counts().plot(kind='pie', autopct='%1.0f%%', colors=["red", "yellow", "blue"])

In [None]:
tweets_neu["Country"].value_counts().plot(kind='pie', autopct='%1.0f%%', colors=["red", "yellow", "blue"])

### Convert label_id to numeric  as - {'neu' : 2, 'pos' : 0, 'neg' : 1} 

In [None]:
tweets_original_df.columns

In [None]:
# Create the dictionary 
class_dictionary = {'neu' : 2, 'pos' : 0, 'neg' : 1} 
  
# Add a new column named 'Price' 
tweets_original_df['class'] = tweets_original_df['label_id'].map(class_dictionary) 

In [None]:
#Dropping label-id  column from main dataframe as it has been converted to class column as numeric
tweets_original_df = tweets_original_df.drop(["label_id"], axis=1)

In [None]:
# Check label (class in numeric) distribution - # 2 = Neutral, 0 = Positive  , 1 = Negative
tweets_original_df["class"].value_counts()  

In [None]:
tweets_original_df["class"].value_counts().plot(kind='pie', autopct='%1.0f%%', colors=["red", "yellow", "blue"])

In [None]:
print(tweets_original_df.head(5))

#### Explore Negative tweet and its catergory ( column - Tweet-Class_category-Code)

In [None]:
tweets_neg["Tweet-Class_category-Code"].value_counts().plot(kind='pie', autopct='%1.0f%%', colors=["red", "yellow", "blue","green"])

##### Explore distribution "no of twwets" (column - retweet_count) for each label

In [None]:
tweets_df["retweet_count"].value_counts().plot(kind='pie', autopct='%1.0f%%', colors=["red", "yellow", "blue"])

# Tweet pattern

### WordCloud of each class

In [None]:
positive_tweets = ' '.join(tweets_original_df[tweets_original_df['class'] == 0]['Tweet'].str.lower())

In [None]:
neutral_tweets = ' ' .join(tweets_original_df[tweets_original_df['class'] == 2]['Tweet'].str.lower())

In [None]:
negative_tweets = ' '.join(tweets_original_df[tweets_original_df['class'] == 1]['Tweet'].str.lower())

In [None]:
# pip install wordcloud
from wordcloud import WordCloud, STOPWORDS

## POSITIVE Tweets Word Cloud

In [None]:
# "stop words", in simple terms it refers to the most common words in a language. 
# These are typically uninformative words, such as "the" or "and", for example, 
# that are thus removed during preprocessing in many Natural Language Processing (NLP) applications.
wordcloud = WordCloud(stopwords = STOPWORDS, background_color = "white", max_words = 1000).generate(positive_tweets)
plt.figure(figsize = (12, 8))
plt.imshow(wordcloud)
plt.axis("off")
plt.title("Positive tweets Wordcloud")

## NEGATIVE Tweets Word Cloud

In [None]:
wordcloud = WordCloud(stopwords = STOPWORDS, background_color = "white", max_words = 1000).generate(negative_tweets)
plt.figure(figsize = (12, 8))
plt.imshow(wordcloud)
plt.axis("off")
plt.title("Negative tweets Wordcloud")

## NEUTRAL Tweets Word Cloud

In [None]:
wordcloud = WordCloud(stopwords = STOPWORDS, background_color = "white", max_words = 1000).generate(neutral_tweets)
plt.figure(figsize = (12, 8))
plt.imshow(wordcloud)
plt.axis("off")
plt.title("Neutral tweets Wordcloud")

In [None]:
#Check null before splitting
columnsWiseMissingValue = find_missing_values_func(tweets_original_df) 
print(columnsWiseMissingValue)

In [None]:
print(colored("Class distribution:", "yellow"))
print(tweets_original_df['class'].value_counts())

# Saving Step 1 Pre-Processing Data

In [None]:
#Save first round cleaned tweets_original_df
tweets_original_df.to_csv(OutputFolder+"/Step1_PreProcessing_Group33_Cleaned_Tweets.csv", index = False)
print(colored("DATA SAVED", "green"))

# ----DONE----