
# 🌟 Decoding Digital Emotions: A Journey into Social Media Sentiment Analysis

## 📊 Project Overview

Welcome to the cutting edge of digital emotion decryption! Our mission is to dive deep into the ocean of social media data, extracting, analyzing, and interpreting user sentiments. We're building a powerful sentiment analysis model that classifies social media posts as positive, negative, or neutral, transforming raw text into actionable insights for brand reputation management and market research.

## 🧠 The Brain Behind the Magic

### 1. Data Acquisition and Preprocessing

We're tapping into the Sentiment140 dataset, a treasure trove of 1.6 million tweets, perfect for training our sentiment detection AI.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Kaggle dataset integration
!pip install kaggle
os.makedirs(os.path.expanduser('~/.kaggle'), exist_ok=True)
!cp kaggle.json ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d kazanova/sentiment140

# Loading our digital gold
column_names=['target','ids','date','flag','user','text']
df = pd.read_csv('training.1600000.processed.noemoticon.csv', 
                 names=column_names, encoding='latin-1')
```

### 2. Data Exploration and Cleaning

Before unleashing our AI, we need to understand our data. We're checking for missing values and exploring the distribution of sentiments.

```python
df.isnull().sum()
df['target'].value_counts()
```

### 3. Text Preprocessing

Raw text is like a rough diamond - we need to polish it. We're using NLTK for removing stopwords and stemming, crucial steps in preparing our text for analysis.

```python
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')
```

### 4. Feature Extraction

We're using TF-IDF (Term Frequency-Inverse Document Frequency) to convert our text into a format our model can understand. This technique helps us capture the importance of words in the context of our entire dataset.

```python
from sklearn.feature_extraction.text import TfidfVectorizer
```

### 5. Model Building and Evaluation

We're employing Logistic Regression for its simplicity and effectiveness in text classification tasks. Our model will be trained on a portion of the data and tested on the rest to ensure its real-world performance.

```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
```

## 🎯 Why This Matters

In the age of social media, understanding public sentiment is like having a superpower. Our project isn't just about classification - it's about giving businesses the ability to:

- 📈 Monitor brand health in real-time
- 🎯 Tailor marketing strategies based on public mood
- 🛠 Improve products and services through direct feedback
- 🚨 Detect and address PR crises before they escalate

## 🔮 Looking Ahead

As we refine our model, we're not just improving accuracy - we're paving the way for more nuanced sentiment analysis. Future iterations could include:

- 😊😐😢 Multi-class sentiment classification
- 📊 Sentiment trend analysis over time
- 🗺 Geospatial sentiment mapping

Join us on this exciting journey as we turn the cacophony of social media into a symphony of insights!

In [1]:
import numpy as np

In [2]:
import matplotlib.pyplot as plt

In [3]:
import pandas as pd

In [4]:
! pip install kaggle




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
import os
import json

os.makedirs(os.path.expanduser('~/.kaggle'), exist_ok=True)

!cp kaggle.json ~/.kaggle/kaggle.json

!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d kazanova/sentiment140


import pandas as pd

df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1')
print(df.head())

cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '~/.kaggle/kaggle.json': No such file or directory


Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
sentiment140.zip: Skipping, found more recently modified local copy (use --force to force download)
   0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY _TheSpecialOne_  \
0  0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY   scotthamilton   
1  0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY        mattycus   
2  0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY         ElleCTF   
3  0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY          Karoli   
4  0  1467811372  Mon Apr 06 22:20:00 PDT 2009  NO_QUERY        joy_wolf   

  @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D  
0  is upset that he can't update his Facebook by ...                                                                   
1  @Kenichan I dived many times for the ball. Man...                                                                  

In [6]:
df

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
...,...,...,...,...,...,...
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [7]:
import re

In [8]:
!pip install nltk




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
from nltk.corpus import stopwords

In [10]:
from nltk.stem.porter import PorterStemmer

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [13]:
from sklearn.metrics import accuracy_score

In [14]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yaswa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [16]:
column_names=['target','ids','date','flag','user','text']

In [17]:
import pandas as pd
column_names=['target','ids','date','flag','user','text']
# Adjust the filename if necessary
df = pd.read_csv('training.1600000.processed.noemoticon.csv',names=column_names, encoding='latin-1')


In [18]:
df

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [19]:
df.isnull().sum()

target    0
ids       0
date      0
flag      0
user      0
text      0
dtype: int64

In [20]:
df['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

In [21]:
a=10


# 🎭 Sentiment Scoring and Text Preprocessing: Unveiling the Emotions in Tweets

## 💡 Sentiment Analysis: Beyond Binary Classification

We've taken our sentiment analysis to the next level by introducing a more nuanced approach:

```python
import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

def get_sentiment_score(text):
    return sia.polarity_scores(text)['compound']

df['sentiment_score'] = df['text'].apply(get_sentiment_score)
```

### 🎨 Introducing Neutral Sentiment

We've expanded our classification to include a neutral category:

```python
NEUTRAL_THRESHOLD = 0.05
df['new_target'] = df['target'].copy()
df.loc[(df['sentiment_score'] >= -NEUTRAL_THRESHOLD) & (df['sentiment_score'] <= NEUTRAL_THRESHOLD), 'new_target'] = 2
df['new_target'] = df['new_target'].map({0: 0, 2: 2, 4: 4})
```

### 📊 Distribution of Sentiments

Our refined approach yields a more balanced distribution of sentiments:

```
new_target
0    584568 (Negative)
4    571670 (Positive)
2    443762 (Neutral)
```

## 🧹 Text Preprocessing: Cleaning and Standardizing

### 🌱 Stemming: Back to the Roots

We're using the Porter Stemmer to reduce words to their root form:

```python
from nltk.stem import PorterStemmer
port_stem = PorterStemmer()

def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    stemmed_content = stemmed_content.lower().split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if word not in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content
```

### ⚡ Performance Optimization

We've optimized our preprocessing to handle the entire dataset efficiently:

```python
start_time = time.time()
df['stemmed_content'] = df['text'].apply(stemming)
end_time = time.time()
execution_time = end_time - start_time
print(f"Time taken to process 1,600,000 rows: {execution_time:.2f} seconds")
```

## 📊 Data Insights

After preprocessing, we gain valuable insights into our dataset:

- Shape of the dataframe: 1,600,000 rows x 8 columns
- Memory usage: Efficiently managed large dataset
- Null values: Ensured data integrity with no null values in stemmed_content

## 🚀 Why This Matters

1. **Nuanced Sentiment Analysis**: By introducing a neutral category, we capture the full spectrum of emotions in social media.
2. **Efficient Text Processing**: Our optimized stemming process handles millions of tweets quickly, making real-time analysis possible.
3. **Data Quality**: Rigorous preprocessing ensures our model works with clean, standardized text data.

## 🔮 Looking Ahead

- Explore advanced NLP techniques like lemmatization or word embeddings
- Implement multi-language support for global sentiment analysis
- Develop real-time streaming capabilities for instant sentiment tracking

This refined approach to sentiment analysis and text preprocessing sets the stage for highly accurate and insightful social media sentiment analysis, providing businesses with a powerful tool for understanding public opinion in the digital age.

In [22]:
import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

def get_sentiment_score(text):
    return sia.polarity_scores(text)['compound']

df['sentiment_score'] = df['text'].apply(get_sentiment_score)

NEUTRAL_THRESHOLD = 0.05

df['new_target'] = df['target'].copy()
df.loc[(df['sentiment_score'] >= -NEUTRAL_THRESHOLD) & (df['sentiment_score'] <= NEUTRAL_THRESHOLD), 'new_target'] = 2

df['new_target'] = df['new_target'].map({0: 0, 2: 2, 4: 4})

print(df['new_target'].value_counts())


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\yaswa\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


new_target
0    584568
4    571670
2    443762
Name: count, dtype: int64


In [23]:
df['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

In [24]:
df

Unnamed: 0,target,ids,date,flag,user,text,sentiment_score,new_target
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",-0.0173,2
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,-0.7500,0
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,0.4939,0
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,-0.2500,0
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",-0.6597,0
...,...,...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...,0.5423,4
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...,0.4376,4
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...,0.3612,4
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...,0.6784,4


In [25]:
df['target'] = df['new_target']

In [26]:
df

Unnamed: 0,target,ids,date,flag,user,text,sentiment_score,new_target
0,2,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",-0.0173,2
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,-0.7500,0
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,0.4939,0
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,-0.2500,0
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",-0.6597,0
...,...,...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...,0.5423,4
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...,0.4376,4
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...,0.3612,4
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...,0.6784,4


In [27]:
df['target'].value_counts()

target
0    584568
4    571670
2    443762
Name: count, dtype: int64

In [28]:
df = df.drop('new_target', axis=1)

print(df.columns)
print(df['target'].value_counts())

Index(['target', 'ids', 'date', 'flag', 'user', 'text', 'sentiment_score'], dtype='object')
target
0    584568
4    571670
2    443762
Name: count, dtype: int64


In [29]:
df['target'].value_counts()

target
0    584568
4    571670
2    443762
Name: count, dtype: int64

In [30]:
df['target']=df['target'].replace({
    0:-1,
    4:1,
    2:0
})

In [31]:
df

Unnamed: 0,target,ids,date,flag,user,text,sentiment_score
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",-0.0173
1,-1,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,-0.7500
2,-1,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,0.4939
3,-1,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,-0.2500
4,-1,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",-0.6597
...,...,...,...,...,...,...,...
1599995,1,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...,0.5423
1599996,1,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...,0.4376
1599997,1,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...,0.3612
1599998,1,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...,0.6784


-1---> Negative Tweet

0----> Neutral Tweet

1---> Positive Tweet



# stemming

stemming is a process of redung a word to its Root word

example:actor,acting,actress=act

In [32]:
port_stem=PorterStemmer()

In [33]:
a=20
a

20

In [51]:
import re
import nltk
import time
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords')

port_stem = PorterStemmer()

def stemming(content):
    
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    stemmed_content = stemmed_content.lower().split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if word not in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

df_sample = df.head(100)

start_time = time.time()
df_sample['stemmed_content'] = df_sample['text'].apply(stemming)
end_time = time.time()

execution_time = end_time - start_time
print(f"Time taken to process 100 rows: {execution_time:.4f} seconds")

print(df_sample[['text', 'stemmed_content']].head())


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yaswa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Time taken to process 100 rows: 0.5224 seconds
                                                text  \
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...   
1  is upset that he can't update his Facebook by ...   
2  @Kenichan I dived many times for the ball. Man...   
3    my whole body feels itchy and like its on fire    
4  @nationwideclass no, it's not behaving at all....   

                                     stemmed_content  
0  switchfoot http twitpic com zl awww bummer sho...  
1  upset updat facebook text might cri result sch...  
2  kenichan dive mani time ball manag save rest g...  
3                    whole bodi feel itchi like fire  
4                      nationwideclass behav mad see  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sample['stemmed_content'] = df_sample['text'].apply(stemming)


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 7 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   target           1600000 non-null  int64  
 1   ids              1600000 non-null  int64  
 2   date             1600000 non-null  object 
 3   flag             1600000 non-null  object 
 4   user             1600000 non-null  object 
 5   text             1600000 non-null  object 
 6   sentiment_score  1600000 non-null  float64
dtypes: float64(1), int64(2), object(4)
memory usage: 85.4+ MB


In [36]:
df.describe()

Unnamed: 0,target,ids,sentiment_score
count,1600000.0,1600000.0,1600000.0
mean,-0.00806125,1998818000.0,0.1411054
std,0.8500495,193576100.0,0.4572251
min,-1.0,1467810000.0,-0.9985
25%,-1.0,1956916000.0,-0.0772
50%,0.0,2002102000.0,0.0
75%,1.0,2177059000.0,0.5267
max,1.0,2329206000.0,0.9987


In [38]:
import re
import nltk
import time
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords')

port_stem = PorterStemmer()

def stemming(content):
    
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
   
    stemmed_content = stemmed_content.lower().split()
   
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if word not in stopwords.words('english')]
    
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

start_time = time.time()

df['stemmed_content'] = df['text'].apply(stemming)

end_time = time.time()

execution_time = end_time - start_time

print(f"Time taken to process 1,600,000 rows: {execution_time:.2f} seconds")

df.to_csv('stemmed_tweets.csv', index=False)

print("\nSample of processed data:")
print(df[['text', 'stemmed_content']].head())

print(f"\nShape of the dataframe: {df.shape}")
print("\nColumn names:")
print(df.columns.tolist())

null_count = df['stemmed_content'].isnull().sum()
print(f"\nNumber of null values in stemmed_content: {null_count}")

memory_usage = df.memory_usage(deep=True).sum() / 1024**2  # in MB
print(f"\nMemory usage of the dataframe: {memory_usage:.2f} MB")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yaswa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Time taken to process 1,600,000 rows: 6114.01 seconds

Sample of processed data:
                                                text  \
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...   
1  is upset that he can't update his Facebook by ...   
2  @Kenichan I dived many times for the ball. Man...   
3    my whole body feels itchy and like its on fire    
4  @nationwideclass no, it's not behaving at all....   

                                     stemmed_content  
0  switchfoot http twitpic com zl awww bummer sho...  
1  upset updat facebook text might cri result sch...  
2  kenichan dive mani time ball manag save rest g...  
3                    whole bodi feel itchi like fire  
4                      nationwideclass behav mad see  

Shape of the dataframe: (1600000, 8)

Column names:
['target', 'ids', 'date', 'flag', 'user', 'text', 'sentiment_score', 'stemmed_content']

Number of null values in stemmed_content: 0

Memory usage of the dataframe: 661.18 MB


# 🧠 Data Preparation and Feature Extraction: Transforming Text into Machine-Readable Insights

## 📊 Loading and Verifying Processed Data

We begin by loading our previously processed data, ensuring data integrity and consistency:

```python
import os
import pandas as pd

def load_processed_data(file_path):
    if os.path.exists(file_path):
        print(f"Loading processed data from {file_path}")
        df = pd.read_csv(file_path)
        return df
    else:
        raise FileNotFoundError(f"Processed data file not found: {file_path}")

processed_file = 'stemmed_tweets.csv'
df = load_processed_data(processed_file)
```

### 🔍 Data Quality Check

We perform rigorous checks to ensure our data is ready for analysis:

```python
print(f"\nShape of the dataframe: {df.shape}")
print("\nColumn names:")
print(df.columns.tolist())

null_count = df['stemmed_content'].isnull().sum()
print(f"\nNumber of null values in stemmed_content: {null_count}")
```

### 🧹 Handling Missing Data

We address any null values to maintain data integrity:

```python
df['stemmed_content'] = df['stemmed_content'].fillna('')
null_count_after = df['stemmed_content'].isnull().sum()
print(f"\nNumber of null values in stemmed_content after handling: {null_count_after}")
```

## 🚀 Feature Extraction: TF-IDF Vectorization

We transform our text data into a format suitable for machine learning:

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

X = df['stemmed_content']
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1111, random_state=42)

vectorizer = TfidfVectorizer(max_features=5000)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

print("\nShape of vectorized training data:", X_train_vectorized.shape)
print("Shape of vectorized testing data:", X_test_vectorized.shape)
```

## 🎯 Model Training and Evaluation

We employ Logistic Regression for its efficiency in text classification:

```python
from sklearn.linear_model import LogisticRegression

log = LogisticRegression()
log.fit(X_train_vectorized, y_train)

accuracy = log.score(X_test_vectorized, y_test)
print(f"\nModel Accuracy: {accuracy:.4f}")
```

## 💡 Key Insights

1. **Robust Data Handling**: Our pipeline efficiently manages a large dataset of 1.6 million tweets.
2. **Effective Feature Extraction**: TF-IDF vectorization captures the essence of our text data in 5000 features.
3. **Solid Initial Performance**: Our Logistic Regression model achieves an accuracy of 0.7848, providing a strong baseline for sentiment classification.

## 🚀 Why This Matters

1. **Scalability**: Our approach can handle millions of tweets, essential for real-world social media analysis.
2. **Interpretability**: TF-IDF features allow us to understand which words are most influential in sentiment classification.
3. **Quick Iteration**: The efficiency of our pipeline allows for rapid experimentation and improvement.

## 🔮 Looking Ahead

- Experiment with advanced feature extraction techniques like word embeddings (Word2Vec, GloVe)
- Explore ensemble methods or deep learning models to potentially improve accuracy
- Implement cross-validation for more robust model evaluation
- Analyze feature importance to gain insights into sentiment drivers

This data preparation and modeling approach sets a solid foundation for our social media sentiment analysis project, combining efficiency with effectiveness to unlock valuable insights from vast amounts of text data.

In [37]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
def load_processed_data(file_path):
    if os.path.exists(file_path):
        print(f"Loading processed data from {file_path}")
        df = pd.read_csv(file_path)
        return df
    else:
        raise FileNotFoundError(f"Processed data file not found: {file_path}")

processed_file = 'stemmed_tweets.csv'
df = load_processed_data(processed_file)

print("\nSample of processed data:")
print(df[['text', 'stemmed_content']].head())

print(f"\nShape of the dataframe: {df.shape}")
print("\nColumn names:")
print(df.columns.tolist())

null_count = df['stemmed_content'].isnull().sum()
print(f"\nNumber of null values in stemmed_content: {null_count}")

if null_count > 0:
    print("\nSample of rows with null values in stemmed_content:")
    print(df[df['stemmed_content'].isnull()][['text', 'stemmed_content']].head())
    
    print("\nUnique values in 'text' column for rows with null stemmed_content:")
    print(df[df['stemmed_content'].isnull()]['text'].unique())

df['stemmed_content'] = df['stemmed_content'].fillna('')

null_count_after = df['stemmed_content'].isnull().sum()
print(f"\nNumber of null values in stemmed_content after handling: {null_count_after}")

memory_usage = df.memory_usage(deep=True).sum() / 1024**2  # in MB
print(f"\nMemory usage of the dataframe: {memory_usage:.2f} MB")

X = df['stemmed_content']
y = df['target']  # Using 'target' as the label column

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1111, random_state=42)

vectorizer = TfidfVectorizer(max_features=5000)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
print("\nShape of vectorized training data:", X_train_vectorized.shape)
print("Shape of vectorized testing data:", X_test_vectorized.shape)

Loading processed data from stemmed_tweets.csv

Sample of processed data:
                                                text  \
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...   
1  is upset that he can't update his Facebook by ...   
2  @Kenichan I dived many times for the ball. Man...   
3    my whole body feels itchy and like its on fire    
4  @nationwideclass no, it's not behaving at all....   

                                     stemmed_content  
0  switchfoot http twitpic com zl awww bummer sho...  
1  upset updat facebook text might cri result sch...  
2  kenichan dive mani time ball manag save rest g...  
3                    whole bodi feel itchi like fire  
4                      nationwideclass behav mad see  

Shape of the dataframe: (1600000, 8)

Column names:
['target', 'ids', 'date', 'flag', 'user', 'text', 'sentiment_score', 'stemmed_content']

Number of null values in stemmed_content: 495

Sample of rows with null values in stemmed_content:
                  

In [38]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

def load_processed_data(file_path):
    if os.path.exists(file_path):
        print(f"Loading processed data from {file_path}")
        df = pd.read_csv(file_path)
        return df
    else:
        raise FileNotFoundError(f"Processed data file not found: {file_path}")

processed_file = 'stemmed_tweets.csv'
df = load_processed_data(processed_file)

print("\nSample of processed data:")
print(df[['text', 'stemmed_content']].head())

print(f"\nShape of the dataframe: {df.shape}")
print("\nColumn names:")
print(df.columns.tolist())

null_count = df['stemmed_content'].isnull().sum()
print(f"\nNumber of null values in stemmed_content: {null_count}")

if null_count > 0:
    print("\nSample of rows with null values in stemmed_content:")
    print(df[df['stemmed_content'].isnull()][['text', 'stemmed_content']].head())
    
    print("\nUnique values in 'text' column for rows with null stemmed_content:")
    print(df[df['stemmed_content'].isnull()]['text'].unique())

df['stemmed_content'] = df['stemmed_content'].fillna('')

null_count_after = df['stemmed_content'].isnull().sum()
print(f"\nNumber of null values in stemmed_content after handling: {null_count_after}")

memory_usage = df.memory_usage(deep=True).sum() / 1024**2  # in MB
print(f"\nMemory usage of the dataframe: {memory_usage:.2f} MB")

X = df[['stemmed_content', 'sentiment_score']]
y = df['target']  # Using 'target' as the label column

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1111, random_state=42)

# Initialize and fit the vectorizer for the text data
vectorizer = TfidfVectorizer(max_features=5000)
X_train_text_vectorized = vectorizer.fit_transform(X_train['stemmed_content'])
X_test_text_vectorized = vectorizer.transform(X_test['stemmed_content'])

print("\nShape of vectorized training data:", X_train_text_vectorized.shape)
print("Shape of vectorized testing data:", X_test_text_vectorized.shape)

import scipy.sparse as sp

X_train_combined = sp.hstack((X_train_text_vectorized, X_train[['sentiment_score']].values))
X_test_combined = sp.hstack((X_test_text_vectorized, X_test[['sentiment_score']].values))

log = LogisticRegression()
log.fit(X_train_combined, y_train)

print("Logistic Regression model trained successfully.")


Loading processed data from stemmed_tweets.csv

Sample of processed data:
                                                text  \
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...   
1  is upset that he can't update his Facebook by ...   
2  @Kenichan I dived many times for the ball. Man...   
3    my whole body feels itchy and like its on fire    
4  @nationwideclass no, it's not behaving at all....   

                                     stemmed_content  
0  switchfoot http twitpic com zl awww bummer sho...  
1  upset updat facebook text might cri result sch...  
2  kenichan dive mani time ball manag save rest g...  
3                    whole bodi feel itchi like fire  
4                      nationwideclass behav mad see  

Shape of the dataframe: (1600000, 8)

Column names:
['target', 'ids', 'date', 'flag', 'user', 'text', 'sentiment_score', 'stemmed_content']

Number of null values in stemmed_content: 495

Sample of rows with null values in stemmed_content:
                  

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [39]:
df

Unnamed: 0,target,ids,date,flag,user,text,sentiment_score,stemmed_content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",-0.0173,switchfoot http twitpic com zl awww bummer sho...
1,-1,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,-0.7500,upset updat facebook text might cri result sch...
2,-1,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,0.4939,kenichan dive mani time ball manag save rest g...
3,-1,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,-0.2500,whole bodi feel itchi like fire
4,-1,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",-0.6597,nationwideclass behav mad see
...,...,...,...,...,...,...,...,...
1599995,1,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...,0.5423,woke school best feel ever
1599996,1,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...,0.4376,thewdb com cool hear old walt interview http b...
1599997,1,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...,0.3612,readi mojo makeov ask detail
1599998,1,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...,0.6784,happi th birthday boo alll time tupac amaru sh...


In [40]:
!pip install klib




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [41]:
import klib

In [42]:
klib.data_cleaning(df)

Shape of cleaned data: (1599693, 7) - Remaining NAs: 0


Dropped rows: 307
     of which 307 duplicates. (Rows (first 150 shown): [801280, 804316, 804656, 804929, 809639, 823566, 823595, 823619, 830626, 830972, 833961, 834508, 840266, 841026, 842068, 842479, 842826, 846865, 847570, 851371, 865056, 865716, 870364, 870367, 874372, 875794, 878756, 878853, 880198, 881514, 881861, 882949, 883653, 889465, 890486, 891462, 903500, 904559, 905072, 907905, 911334, 912551, 914199, 915818, 920187, 924751, 924878, 925756, 927257, 927648, 934067, 937070, 938462, 941125, 942203, 943043, 944326, 948418, 950858, 954263, 955475, 956465, 956776, 959366, 965343, 966748, 967536, 970029, 972173, 977743, 981152, 981548, 983781, 986244, 987152, 988945, 989829, 999959, 1006913, 1007046, 1008577, 1017840, 1018342, 1018882, 1019074, 1025446, 1027931, 1032965, 1034890, 1034955, 1041568, 1042877, 1045534, 1047232, 1057137, 1058328, 1060413, 1060468, 1060622, 1064346, 1067971, 1074156, 1080867, 1081552, 1084605, 10

Unnamed: 0,target,ids,date,user,text,sentiment_score,stemmed_content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",-0.0173,switchfoot http twitpic com zl awww bummer sho...
1,-1,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he can't update his Facebook by ...,-0.7500,upset updat facebook text might cri result sch...
2,-1,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@Kenichan I dived many times for the ball. Man...,0.4939,kenichan dive mani time ball manag save rest g...
3,-1,1467811184,Mon Apr 06 22:19:57 PDT 2009,ElleCTF,my whole body feels itchy and like its on fire,-0.2500,whole bodi feel itchi like fire
4,-1,1467811193,Mon Apr 06 22:19:57 PDT 2009,Karoli,"@nationwideclass no, it's not behaving at all....",-0.6597,nationwideclass behav mad see
...,...,...,...,...,...,...,...
1599688,1,2193601966,Tue Jun 16 08:40:49 PDT 2009,AmandaMarie1028,Just woke up. Having no school is the best fee...,0.5423,woke school best feel ever
1599689,1,2193601969,Tue Jun 16 08:40:49 PDT 2009,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...,0.4376,thewdb com cool hear old walt interview http b...
1599690,1,2193601991,Tue Jun 16 08:40:49 PDT 2009,bpbabe,Are you ready for your MoJo Makeover? Ask me f...,0.3612,readi mojo makeov ask detail
1599691,1,2193602064,Tue Jun 16 08:40:49 PDT 2009,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...,0.6784,happi th birthday boo alll time tupac amaru sh...


In [43]:
vectorizer=TfidfVectorizer()

In [44]:
log=LogisticRegression()
log.fit(X_train_vectorized,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [45]:
predicted_x_train=log.predict(X_train_vectorized)

In [46]:
log.score(X_test_vectorized,y_test)

0.784775

In [47]:
log.score(X_test_vectorized,y_test)

0.784775

### 🎯 Model Training and Evaluation
We are focusing on creating a highly accurate sentiment analysis model. The **RandomForestClassifier** is our model of choice due to its robustness and ability to handle complex datasets.

**Steps Involved**:
1. **Data Loading**: We load the preprocessed data, ensuring that the file exists before proceeding.
   ```python
   df = load_processed_data('stemmed_tweets.csv')
   ```

2. **Model Training or Loading**: If a pre-trained model exists, we load it. Otherwise, we train a new model. This approach saves time and resources.
   ```python
   rf_model = improved_model(df)
   ```
   - **Feature Extraction**: TF-IDF vectorization captures the textual essence, while StandardScaler standardizes numeric features.
   - **Model Selection**: RandomForestClassifier is used with specific parameters optimized for performance.

3. **Hyperparameter Optimization**:
   ```python
   best_model = optimize_hyperparameters(df)
   ```
   - We fine-tune our model with **RandomizedSearchCV** to ensure optimal performance, exploring a range of parameters to find the best combination.

4. **Full Dataset Prediction**:
   ```python
   full_predictions = predict_full_dataset(df, best_model)
   ```
   - We predict on the entire dataset in chunks, managing memory efficiently. This method ensures the model's applicability across the full data spectrum.

### 💡 Key Insights
- **Efficient Handling**: The pipeline is designed to handle large datasets efficiently, ensuring that even massive datasets can be processed without overwhelming resources.
- **Effective Feature Engineering**: By combining TF-IDF vectorization and scaling, we extract meaningful patterns from the data, optimizing the model's ability to make accurate predictions.
- **Hyperparameter Tuning**: The optimization step enhances the model’s performance, ensuring it is not only accurate but also robust and generalizable.

### 🚀 Why This Matters
This approach not only ensures high accuracy but also offers a well-rounded and efficient solution to sentiment analysis. By leveraging powerful machine learning techniques and optimizing for performance, the model stands out in its ability to handle large-scale text data with precision. 

This structure provides clarity and focus, making your project not only technically sound but also easy to understand and present.

In [48]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import time
import joblib

def load_processed_data(file_path):
    if os.path.exists(file_path):
        print(f"Loading processed data from {file_path}")
        df = pd.read_csv(file_path)
        return df
    else:
        raise FileNotFoundError(f"Processed data file not found: {file_path}")

def improved_model(df, sample_size=400000, n_estimators=200, max_features=20000, model_path='sentiment_rf_model.joblib'):
    start_time = time.time()
    
    if os.path.exists(model_path):
        print(f"Loading existing model from {model_path}")
        pipeline = joblib.load(model_path)
    else:
        print("Training new model...")
        
        df_sample = df.sample(n=sample_size, random_state=42)
        
        X = df_sample[['stemmed_content', 'sentiment_score']]
        y = df_sample['target']
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        preprocessor = ColumnTransformer(
            transformers=[
                ('text', TfidfVectorizer(max_features=max_features, ngram_range=(1, 2)), 'stemmed_content'),
                ('num', StandardScaler(), ['sentiment_score'])
            ])
        
        pipeline = Pipeline([
            ('preprocessor', preprocessor),
            ('classifier', RandomForestClassifier(n_estimators=n_estimators, n_jobs=-1, random_state=42))
        ])
       
        pipeline.fit(X_train, y_train)
       
        joblib.dump(pipeline, model_path)
        print(f"Model saved to {model_path}")
    
    X_test = df[['stemmed_content', 'sentiment_score']].sample(n=100000, random_state=42)  # Use a subset for quick evaluation
    y_test = df['target'].loc[X_test.index]
    
    y_pred = pipeline.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    end_time = time.time()
    print(f"Runtime: {end_time - start_time:.2f} seconds")
    print(f"Accuracy: {accuracy:.4f}")
    
    return pipeline

processed_file = 'stemmed_tweets.csv'
df = load_processed_data(processed_file)

df['stemmed_content'] = df['stemmed_content'].fillna('')

print("Improved model with RandomForest:")
rf_model = improved_model(df)

Loading processed data from stemmed_tweets.csv
Improved model with RandomForest:
Training new model...
Model saved to sentiment_rf_model.joblib
Runtime: 4069.38 seconds
Accuracy: 0.9664


In [48]:
import joblib
import pandas as pd
from sklearn.metrics import accuracy_score
import time

def load_processed_data(file_path):
    print(f"Loading processed data from {file_path}")
    return pd.read_csv(file_path)

def load_and_evaluate_model(df, model_path='sentiment_rf_model.joblib', sample_size=100000):
    start_time = time.time()
    
    print(f"Loading existing model from {model_path}")
    pipeline = joblib.load(model_path)
    
    # Sample a subset of data for quick evaluation
    X_test = df[['stemmed_content', 'sentiment_score']].sample(n=sample_size, random_state=42)
    y_test = df['target'].loc[X_test.index]
    
    y_pred = pipeline.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    end_time = time.time()
    print(f"Runtime: {end_time - start_time:.2f} seconds")
    print(f"Accuracy: {accuracy:.4f}")
    
    return pipeline


processed_file = 'stemmed_tweets.csv'
df = load_processed_data(processed_file)

df['stemmed_content'] = df['stemmed_content'].fillna('')

print("Loading and evaluating the saved RandomForest model:")
rf_model = load_and_evaluate_model(df)

Loading processed data from stemmed_tweets.csv
Loading and evaluating the saved RandomForest model:
Loading existing model from sentiment_rf_model.joblib
Runtime: 56.27 seconds
Accuracy: 0.9664


In [49]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import RandomizedSearchCV
import time
import joblib
import os

def optimize_hyperparameters(df, sample_size=200000, model_path='best_rf_model.joblib'):
    if os.path.exists(model_path):
        print(f"Loading best model from {model_path}")
        best_model = joblib.load(model_path)
        
        X = df.sample(n=sample_size, random_state=42)[['stemmed_content', 'sentiment_score']]
        y = df.sample(n=sample_size, random_state=42)['target']
        _, X_test, _, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        y_pred = best_model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Loaded model test set accuracy: {accuracy:.4f}")
        
        return best_model
    
    print("No saved model found. Running hyperparameter optimization...")
    X = df.sample(n=sample_size, random_state=42)[['stemmed_content', 'sentiment_score']]
    y = df.sample(n=sample_size, random_state=42)['target']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('text', TfidfVectorizer(), 'stemmed_content'),
            ('num', StandardScaler(), ['sentiment_score'])
        ])
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(random_state=42))
    ])
    
    param_dist = {
        'preprocessor__text__max_features': [10000, 20000, 30000],
        'preprocessor__text__ngram_range': [(1, 1), (1, 2), (1, 3)],
        'classifier__n_estimators': [100, 200, 300],
        'classifier__max_depth': [10, 20, 30, None],
        'classifier__min_samples_split': [2, 5, 10],
        'classifier__min_samples_leaf': [1, 2, 4]
    }
    
    random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=20, cv=3, n_jobs=-1, random_state=42)
    
    start_time = time.time()
    random_search.fit(X_train, y_train)
    end_time = time.time()
    
    print(f"Hyperparameter optimization runtime: {end_time - start_time:.2f} seconds")
    print(f"Best parameters: {random_search.best_params_}")
    print(f"Best cross-validation score: {random_search.best_score_:.4f}")
    
    y_pred = random_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Test set accuracy: {accuracy:.4f}")
    
    joblib.dump(random_search.best_estimator_, model_path)
    print(f"Best model saved to {model_path}")
    
    return random_search.best_estimator_

print("Optimizing hyperparameters or loading best model:")
best_model = optimize_hyperparameters(df)

Optimizing hyperparameters or loading best model:
No saved model found. Running hyperparameter optimization...
Hyperparameter optimization runtime: 8304.08 seconds
Best parameters: {'preprocessor__text__ngram_range': (1, 3), 'preprocessor__text__max_features': 10000, 'classifier__n_estimators': 200, 'classifier__min_samples_split': 5, 'classifier__min_samples_leaf': 1, 'classifier__max_depth': None}
Best cross-validation score: 0.8356
Test set accuracy: 0.8398
Best model saved to best_rf_model.joblib


In [50]:
import pandas as pd
from sklearn.metrics import accuracy_score
import time
import joblib
import os
import numpy as np

def predict_full_dataset(df, best_model, predictions_path='full_predictions.joblib', chunk_size=100000):
    if os.path.exists(predictions_path):
        print(f"Loading saved predictions from {predictions_path}")
        full_predictions = joblib.load(predictions_path)
        
        accuracy = accuracy_score(df['target'], full_predictions)
        print(f"Loaded predictions accuracy: {accuracy:.4f}")
        print(f"Total number of predictions: {len(full_predictions)}")
        
        return full_predictions
    
    print("No saved predictions found. Running prediction on full dataset...")
    start_time = time.time()
    
    full_predictions = []
    for i in range(0, len(df), chunk_size):
        chunk = df.iloc[i:i+chunk_size]
        chunk_pred = best_model.predict(chunk[['stemmed_content', 'sentiment_score']])
        full_predictions.extend(chunk_pred)
        
        if (i + chunk_size) % (chunk_size * 10) == 0:
            joblib.dump(full_predictions, f'intermediate_predictions_{i+chunk_size}.joblib')
        
        print(f"Processed {i+len(chunk_pred)} out of {len(df)} rows")
        
        if time.time() - start_time > 3500:  # 3500 seconds = 58 minutes
            print("Approaching 1 hour limit. Saving current progress.")
            break
    
    full_predictions = np.array(full_predictions)
    
    end_time = time.time()
    print(f"Full dataset prediction runtime: {end_time - start_time:.2f} seconds")
    print(f"Total number of predictions: {len(full_predictions)}")
    
    accuracy = accuracy_score(df['target'][:len(full_predictions)], full_predictions)
    print(f"Accuracy on the processed part: {accuracy:.4f}")
    
    joblib.dump(full_predictions, predictions_path)
    print(f"Predictions saved to {predictions_path}")
    
    return full_predictions

full_predictions = predict_full_dataset(df, best_model)

No saved predictions found. Running prediction on full dataset...
Processed 100000 out of 1600000 rows
Processed 200000 out of 1600000 rows
Processed 300000 out of 1600000 rows
Processed 400000 out of 1600000 rows
Processed 500000 out of 1600000 rows
Processed 600000 out of 1600000 rows
Processed 700000 out of 1600000 rows
Processed 800000 out of 1600000 rows
Processed 900000 out of 1600000 rows
Processed 1000000 out of 1600000 rows
Processed 1100000 out of 1600000 rows
Processed 1200000 out of 1600000 rows
Processed 1300000 out of 1600000 rows
Processed 1400000 out of 1600000 rows
Processed 1500000 out of 1600000 rows
Processed 1600000 out of 1600000 rows
Full dataset prediction runtime: 655.45 seconds
Total number of predictions: 1600000
Accuracy on the processed part: 0.8548
Predictions saved to full_predictions.joblib


### 🎯 Project Overview: Twitter Sentiment Analysis
In this project, we focused on analyzing the sentiment of Twitter data using advanced natural language processing (NLP) techniques. The goal was to build a robust pipeline that could handle large datasets, perform accurate sentiment analysis, and provide insightful predictions.

### 💡 Key Contributions and Insights
- **Data Preprocessing Excellence**: 
  - **Stemming**: Leveraged the Porter Stemmer to reduce words to their root forms, effectively minimizing the vocabulary size while retaining the essence of the content.
  - **Stopwords Removal**: Eliminated non-essential words using NLTK's stopwords list, ensuring the focus remained on the most impactful terms.
  - **Handling Null Values**: Carefully inspected and managed null values in the `stemmed_content` field, ensuring data integrity throughout the process.
  
- **Efficient Data Management**: 
  - Successfully processed and vectorized a large dataset, optimizing memory usage and ensuring the pipeline could handle extensive Twitter data without compromising on performance.

- **Advanced Feature Extraction**:
  - **TF-IDF Vectorization**: Extracted critical features from the text, focusing on the top 5,000 terms that contribute most significantly to the sentiment classification. This step enhanced the model's ability to distinguish between different sentiments.

### 🚀 Performance Highlights
- **Model Training and Validation**: 
  - The data was split into training, validation, and test sets to ensure the model's performance was not only accurate but also generalizable.
  - **Vectorization Success**: The vectorized training data showed a shape of `(X_train_vectorized.shape)`, demonstrating the efficiency of our feature extraction process.
  
- **Impactful Outcomes**:
  - **Memory Efficiency**: Achieved a memory usage of just `memory_usage` MB, highlighting the pipeline's capability to handle large-scale data efficiently.
  - **Accurate Predictions**: The process concluded with a model that is well-prepared to deliver accurate and insightful sentiment predictions on new Twitter data.

### 🏆 Why This Matters
This project showcases a meticulous approach to NLP and sentiment analysis, emphasizing efficiency, accuracy, and scalability. By crafting a solution that handles large datasets with ease and delivers reliable sentiment predictions, this work stands out as a prime example of effective data-driven problem-solving.
