# Description
## Pandemic Tweet Challenge
The Google Developers Students Club of IIT Indore brings the Pandemic Tweet Challenge, based on Natural Language Processing to identify the nature of the Covid Tweet from one of these five categories- extremely positive, positive, neutral, negative, extremely negative.

### Problem Statement:
Your job is to identify the nature of the covid tweet from a given tweet and assign it a class label - extremely positive, positive, neutral, negative, extremely negative.

You are provided a train dataset, which contains the tweets and their corresponding labeled sentiments. The tweets have been pulled from Twitter. The names and usernames have been given codes to avoid any privacy concerns. Your job is to perform Text Classification on the provided data.

### Dataset:
You are provided two splits of data - train and test. Navigate to the Data Section for more details regarding these.

### Submission:
There is a sample submission CSV attached. This file has two columns. The first column contains Original tweets. In the second column, you have to predict the sentiment class label (extremely positive, positive, neutral, negative, extremely negative) for each of the tweets in the first column. You have to submit your file in this format only.

Explaining your approach in your submission notebook is mandatory. The code submitted must be able to replicate your final submission file.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/pandemic-tweet-challenge/submission_.csv
/kaggle/input/pandemic-tweet-challenge/Pandemic_NLP_test_.csv
/kaggle/input/pandemic-tweet-challenge/Pandemic_NLP_train.csv


# Loading, EDA and preprocessing

In [2]:
train_df = pd.read_csv("/kaggle/input/pandemic-tweet-challenge/Pandemic_NLP_train.csv", encoding="ISO-8859-1")
print(train_df.shape[0])
train_df.head()

41157


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [3]:
train_df['Sentiment'].unique()

array(['Neutral', 'Positive', 'Extremely Negative', 'Negative',
       'Extremely Positive'], dtype=object)

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41157 entries, 0 to 41156
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   UserName       41157 non-null  int64 
 1   ScreenName     41157 non-null  int64 
 2   Location       32567 non-null  object
 3   TweetAt        41157 non-null  object
 4   OriginalTweet  41157 non-null  object
 5   Sentiment      41157 non-null  object
dtypes: int64(2), object(4)
memory usage: 1.9+ MB


In [5]:
train_df['Location'] = train_df['Location'].fillna('unknown')

In [6]:
test_df = pd.read_csv("/kaggle/input/pandemic-tweet-challenge/Pandemic_NLP_test_.csv")
print(test_df.shape[0])
test_df.head()

3798


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet
0,1956,46908,0.9921875,13-03-2020,COVID-19 might be presenting online shopping p...
1,711,45663,210.0,12/3/2020,My right wing coo coo father in law was tellin...
2,1346,46298,310.0,13-03-2020,I cannot decide if I am the smartest person in...
3,2204,47156,505.0,14-03-2020,Why are people stock piling what s wrong wit...
4,1265,46217,21113.0,13-03-2020,Show me where the eggs are. That's all I need ...


In [7]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3798 entries, 0 to 3797
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   UserName       3798 non-null   int64 
 1   ScreenName     3798 non-null   int64 
 2   Location       2964 non-null   object
 3   TweetAt        3798 non-null   object
 4   OriginalTweet  3798 non-null   object
dtypes: int64(2), object(3)
memory usage: 148.5+ KB


In [8]:
test_df['Location'] = test_df['Location'].fillna('unknown')

In [9]:
submission_df = pd.read_csv("/kaggle/input/pandemic-tweet-challenge/submission_.csv", encoding="ISO-8859-1")
submission_df.head()

Unnamed: 0,OriginalTweet,Sentiment
0,TRENDING: New Yorkers encounter empty supermar...,
1,When I couldn't find hand sanitizer at Fred Me...,
2,Find out how you can protect yourself and love...,
3,#Panic buying hits #NewYork City as anxious sh...,
4,#toiletpaper #dunnypaper #coronavirus #coronav...,


In [10]:
submission_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3798 entries, 0 to 3797
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   OriginalTweet  3798 non-null   object 
 1   Sentiment      0 non-null      float64
dtypes: float64(1), object(1)
memory usage: 59.5+ KB


In [11]:
import re
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')

def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'@\S+', '', text)  # Remove mentions
    text = re.sub(r'#', '', text)  # Remove hashtags
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = text.lower()  # Convert to lowercase
    text = ' '.join([word for word in text.split() if word not in stopwords.words('english')])  # Remove stopwords
    return text

train_df['cleaned_tweet'] = train_df['OriginalTweet'].apply(clean_text)
test_df['cleaned_tweet'] = test_df['OriginalTweet'].apply(clean_text)

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Conclusion
Data consists of 44157 tweets in training dataset and 3798 in testing dataset. Tweets contain some problematic symbols, urls, hashtags, mentions, numbers, punctuation and stopwords, which we clean. For some tweets there are no location, we fill this column with value 'unknown'

We have submission dataframe with tweets which we have to classify as 'Neutral', 'Positive', 'Extremely Negative', 'Negative', or 'Extremely Positive' as in training dataframe

# Defining and training the model

In [12]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

vectorizer = TfidfVectorizer(max_df=0.7)
X = vectorizer.fit_transform(train_df['cleaned_tweet'])
X_test = vectorizer.transform(test_df['cleaned_tweet'])
y = train_df['Sentiment']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

svm_model = LinearSVC(random_state=42)
svm_model.fit(X_train, y_train)

y_pred = svm_model.predict(X_val)
print(classification_report(y_val, y_pred))

                    precision    recall  f1-score   support

Extremely Negative       0.60      0.58      0.59      1056
Extremely Positive       0.63      0.64      0.63      1330
          Negative       0.50      0.46      0.48      2006
           Neutral       0.59      0.69      0.63      1553
          Positive       0.51      0.49      0.50      2287

          accuracy                           0.55      8232
         macro avg       0.56      0.57      0.57      8232
      weighted avg       0.55      0.55      0.55      8232



# Making predictions and creating the submission

In [13]:
X_submission = vectorizer.transform(submission_df['OriginalTweet'])
submission_predictions = svm_model.predict(X_submission)
submission_df['Sentiment'] = submission_predictions
submission_df.to_csv('submission.csv', index=False)