<a href="https://colab.research.google.com/github/Jeetesh-KumarM/CAPSTONE-PROJECT-3-CLASSIFICATION/blob/main/whole.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Project Name**    - Coronavirus Tweet Sentiment Analysis



##### **Project Type**    - Classification
##### **Contribution**    - Individual


### Import Libraries

In [51]:
# Import Libraries
import pandas as pd
import numpy as np
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import warnings
warnings.filterwarnings('ignore')
# Sklearn Libraries
from sklearn import preprocessing
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

### Dataset Loading

In [52]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [53]:
df=pd.read_csv('/content/drive/MyDrive/AlmaBetter/Coronavirus Tweet Sentiment Analysis/Coronavirus Tweets.csv',encoding='latin-1')

### Data Wrangling Code

In [54]:
# Write your code to make your dataset analysis ready.
df1=df.copy()

In [55]:
df1['Sentiment']=df1['Sentiment'].replace(to_replace=["Extremely Positive", "Extremely Negative"],value=["Positive","Negative"])

###  Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Handling Missing Values

In [56]:
# Handling Missing Values & Missing Value Imputation
#Location column had null values \
df1.dropna(inplace=True)
df1 = df1.reset_index()

#### 2. Expand Contraction

Converting words like it'll, would've etc into it will, would have etc.

In [57]:
!pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [58]:
# Expand Contraction
import contractions
def cont(x):
  return contractions.fix(x)

In [59]:
df1['Mod_Tweet']=df1['OriginalTweet'].apply(cont)

#### 3. Lower Casing

In [60]:
# Lower Casing
df1['Mod_Tweet']=df1["Mod_Tweet"].str.lower()

#### 4. Removing URLs & Removing words and digits contain digits.

In [61]:
# Remove URLs & Remove words and digits contain digits
import re

def remove_url(text_data):
  text=re.sub(r"http\S+", "", text_data)
  return remove_numbers(text)
# function to remove numbers
def remove_numbers(text):
  # define the pattern to keep
  pattern = r'[^a-zA-z.,!?/:;\"\'\s]' 
  return re.sub(pattern, '', text)

In [62]:
df1['Mod_Tweet']=df1['Mod_Tweet'].apply(remove_url)

#### 5. Removing Punctuations

In [63]:
# Remove Punctuations
import string
string.punctuation
punct_list = list(string.punctuation)
def remove_punctuation(text):
    for punc in punct_list:
        if punc in text:
            text = text.replace(punc, ' ')
    #return remove_special_characters(text)
    return remove_special_characters(text.strip())
# function to remove special characters
def remove_special_characters(text):
    # define the pattern to keep
    pat = r'[^a-zA-z0-9.,!?/:;\"\']' 
    return re.sub(pat, ' ', text)

In [64]:
df1['Mod_Tweet']=df1['Mod_Tweet'].apply(remove_punctuation)

#### 6. Removing Stopwords & Removing White spaces

In [65]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [66]:
# Remove Stopwords and White spaces
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    #White space in joined with the string so there is not need to check for extra white spaces 
    return " ".join([word for word in str(text).split() if word not in stop_words]) 

In [67]:
df1['Mod_Tweet']=df1['Mod_Tweet'].apply(remove_stopwords)

In [68]:
# Remove White spaces
#The White spaces situation is taken care in the above function
#another function to check for white spaces
'''def remove_whitespaces(text):
  return re.sub(' +', ' ', text)'''

"def remove_whitespaces(text):\n  return re.sub(' +', ' ', text)"

#### 7. Tokenization

In [69]:
#Tokenization
from nltk.tokenize import word_tokenize
nltk.download('punkt')
def token(y):
  return word_tokenize(y)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [70]:
df1['Mod_Tweet']=df1['Mod_Tweet'].apply(token)

#### 8. Text Normalization

In [71]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem.porter import *
stemmer = PorterStemmer()

In [72]:
# Stemming
#function for stemming
def stemming(text): 
    text = [stemmer.stem(word) for word in text]
    return (" ".join(text))

In [73]:
df1['Mod_Tweet']=df1['Mod_Tweet'].apply(stemming)

### Model Implementation-Tuned Logistic Regression

In [74]:
X=df1.Mod_Tweet
y=df1.Sentiment
# Fit the Algorithm
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=50)
LR = LogisticRegression()
parameters = dict(penalty=['l1', 'l2'],C=[10, 1.0, 0.1, 0.01])
tvec = TfidfVectorizer()
#Hyperparameter tuning by GridserchCV
logreg_Gcv=GridSearchCV(LR,parameters,cv=15)

model = Pipeline([('vectorizer',tvec),('classifier',logreg_Gcv)])
# Fit the Algorithm
model.fit(X_train, y_train)
# Predict on the model
y2_pred = model.predict(X_test)

#### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [75]:
# Save the File
import pickle
pickle.dump(model, open('model.pkl', 'wb'))

#### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [76]:
# Load the File and predict unseen data.
pickled_model = pickle.load(open('model.pkl', 'rb'))
pickled_model.predict(['I had a good day','I had a normal day','I had a bad day'])

array(['Positive', 'Neutral', 'Negative'], dtype=object)