# Dharani Nareddy

# Campus ID: LN66171

## Introduction

This sentiment analysis project focuses on a dataset comprising text samples and sentiment labels. The dataset is divided into a training set and a production set.

In the training set, each text sample is labeled with a sentiment category: positive, negative, or neutral. The primary objective of the training set is to train the sentiment analysis model. Through this training process, the model learns to associate text content with specific sentiment labels.

On the other hand, the production set consists of unlabeled text samples that represent real-world data. These samples are used to evaluate the model's performance and its ability to generalize to new, unseen data. The trained model is applied to the production set, predicting the sentiment of each text sample.

By utilizing the training set to train the model and evaluating its performance on the production set, this project aims to develop a robust solution for sentiment analysis. The ultimate goal is to automatically analyze and classify sentiments in text data accurately and effectively.

### Importing libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.sentiment import SentimentIntensityAnalyzer

### Loading the datasets

In [2]:
# Load the training dataset
train_data_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/X_train.csv'
train_data = pd.read_csv(train_data_url)

# Load the production dataset (features)
prod_data_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/X_final.csv'
prod_data = pd.read_csv(prod_data_url)

# Load the production dataset (labels)
prod_labels_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/y_final.csv'
prod_labels = pd.read_csv(prod_labels_url)

# Load the labels for the training dataset
train_labels_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/y_train.csv'
train_labels = pd.read_csv(train_labels_url)

# Display the first few rows of the datasets
print("Training dataset:")
print(train_data.head())
print("\nProduction dataset (features):")
print(prod_data.head())
print("\nProduction dataset (labels):")
print(prod_labels.head())
print("\nLabels for training dataset:")
print(train_labels.head())


Training dataset:
                                              review
0  Shame, is a Swedish film in Swedish with Engli...
1  I know it's rather unfair to comment on a movi...
2  "Bread" very sharply skewers the conventions o...
3  After reading tons of good reviews about this ...
4  During the Civil war a wounded union soldier h...

Production dataset (features):
                                              review
0  I first saw Heimat 2 on BBC2 in the 90's when ...
1  I sat down to watch "Midnight Cowboy" thinking...
2  I can never fathom why people take time to rev...
3  With that line starts one silly, boring Britis...
4  Here's the spoiler: At the end of the movie, a...

Production dataset (labels):
   sentiment
0          1
1          1
2          1
3          0
4          0

Labels for training dataset:
   sentiment
0          1
1          0
2          1
3          1
4          1


### Exploring the dataset

In [3]:
# Display the first few rows of the datasets
print("Training dataset:")
print(train_data.head())
print("\nProduction dataset (features):")
print(prod_data.head())
print("\nProduction dataset (labels):")
print(prod_labels.head())
print("\nLabels for training dataset:")
print(train_labels.head())


Training dataset:
                                              review
0  Shame, is a Swedish film in Swedish with Engli...
1  I know it's rather unfair to comment on a movi...
2  "Bread" very sharply skewers the conventions o...
3  After reading tons of good reviews about this ...
4  During the Civil war a wounded union soldier h...

Production dataset (features):
                                              review
0  I first saw Heimat 2 on BBC2 in the 90's when ...
1  I sat down to watch "Midnight Cowboy" thinking...
2  I can never fathom why people take time to rev...
3  With that line starts one silly, boring Britis...
4  Here's the spoiler: At the end of the movie, a...

Production dataset (labels):
   sentiment
0          1
1          1
2          1
3          0
4          0

Labels for training dataset:
   sentiment
0          1
1          0
2          1
3          1
4          1


In [4]:
print("Training dataset:")
print(train_data.describe(include='all'))

print("\nProduction dataset (features):")
print(prod_data.describe(include='all'))

print("\nProduction dataset (labels):")
print(prod_labels.describe(include='all'))

print("\nLabels for training dataset:")
print(train_labels.describe(include='all'))

Training dataset:
                                                   review
count                                               40000
unique                                              39719
top     Loved today's show!!! It was a variety and not...
freq                                                    5

Production dataset (features):
                                                   review
count                                               10000
unique                                               9989
top     The Cat in the Hat is just a slap in the face ...
freq                                                    2

Production dataset (labels):
          sentiment
count  10000.000000
mean       0.500000
std        0.500025
min        0.000000
25%        0.000000
50%        0.500000
75%        1.000000
max        1.000000

Labels for training dataset:
          sentiment
count  40000.000000
mean       0.500000
std        0.500006
min        0.000000
25%        0.000000
50%       

In [5]:

# Display information about the datasets
print("Training dataset:")
print(train_data.info())

print("\nProduction dataset (features):")
print(prod_data.info())

print("\nProduction dataset (labels):")
print(prod_labels.info())

print("\nLabels for training dataset:")
print(train_labels.info())


Training dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  40000 non-null  object
dtypes: object(1)
memory usage: 312.6+ KB
None

Production dataset (features):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  10000 non-null  object
dtypes: object(1)
memory usage: 78.2+ KB
None

Production dataset (labels):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   sentiment  10000 non-null  int64
dtypes: int64(1)
memory usage: 78.2 KB
None

Labels for training dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 1 col

In [6]:

# Check if there are any columns in the datasets
print("Training dataset:")
if train_data.empty:
    print("No columns in the training dataset.")
else:
    print("Columns present in the training dataset.")

print("\nProduction dataset (features):")
if prod_data.empty:
    print("No columns in the production dataset (features).")
else:
    print("Columns present in the production dataset (features).")

print("\nProduction dataset (labels):")
if prod_labels.empty:
    print("No columns in the production dataset (labels).")
else:
    print("Columns present in the production dataset (labels).")

print("\nLabels for training dataset:")
if train_labels.empty:
    print("No columns in the labels for training dataset.")
else:
    print("Columns present in the labels for training dataset.")


Training dataset:
Columns present in the training dataset.

Production dataset (features):
Columns present in the production dataset (features).

Production dataset (labels):
Columns present in the production dataset (labels).

Labels for training dataset:
Columns present in the labels for training dataset.


In [7]:
# Retrieve the column names
print("Training dataset columns:")
train_columns = train_data.columns
print(train_columns)

print("\nProduction dataset (features) columns:")
prod_columns = prod_data.columns
print(prod_columns)

print("\nProduction dataset (labels) columns:")
prod_labels_columns = prod_labels.columns
print(prod_labels_columns)

print("\nLabels for training dataset columns:")
train_labels_columns = train_labels.columns
print(train_labels_columns)


Training dataset columns:
Index(['review'], dtype='object')

Production dataset (features) columns:
Index(['review'], dtype='object')

Production dataset (labels) columns:
Index(['sentiment'], dtype='object')

Labels for training dataset columns:
Index(['sentiment'], dtype='object')


In [8]:

# Get the shape of the datasets
print("Training dataset shape:", train_data.shape)
print("Production dataset (features) shape:", prod_data.shape)
print("Production dataset (labels) shape:", prod_labels.shape)
print("Labels for training dataset shape:", train_labels.shape)


Training dataset shape: (40000, 1)
Production dataset (features) shape: (10000, 1)
Production dataset (labels) shape: (10000, 1)
Labels for training dataset shape: (40000, 1)


In [9]:

# Get the number of rows in the datasets
print("Training dataset rows:", len(train_data))
print("Production dataset (features) rows:", len(prod_data))
print("Production dataset (labels) rows:", len(prod_labels))
print("Labels for training dataset rows:", len(train_labels))


Training dataset rows: 40000
Production dataset (features) rows: 10000
Production dataset (labels) rows: 10000
Labels for training dataset rows: 40000


In [68]:
# Sanity Check 1: Sample Training Data
print("Sample Training Data:")
print(train_data.head())

# Sanity Check 2: Sample Production Data (Features)
print("Sample Production Data (Features):")
print(prod_data.head())

# Sanity Check 3: Sample Production Data (Labels)
print("Sample Production Data (Labels):")
print(prod_labels.head())

# Sanity Check 4: Sample Labels for Training Data
print("Sample Labels for Training Data:")
print(train_labels.head())

# Sanity Check 5: Data Shape
print("Training Data Shape:", train_data.shape)
print("Production Data (Features) Shape:", prod_data.shape)
print("Production Data (Labels) Shape:", prod_labels.shape)
print("Labels for Training Data Shape:", train_labels.shape)


Sample Training Data:
                                              review
0  Shame, is a Swedish film in Swedish with Engli...
1  I know it's rather unfair to comment on a movi...
2  "Bread" very sharply skewers the conventions o...
3  After reading tons of good reviews about this ...
4  During the Civil war a wounded union soldier h...
Sample Production Data (Features):
                                              review
0  I first saw Heimat 2 on BBC2 in the 90's when ...
1  I sat down to watch "Midnight Cowboy" thinking...
2  I can never fathom why people take time to rev...
3  With that line starts one silly, boring Britis...
4  Here's the spoiler: At the end of the movie, a...
Sample Production Data (Labels):
   sentiment
0          1
1          1
2          1
3          0
4          0
Sample Labels for Training Data:
   sentiment
0          1
1          0
2          1
3          1
4          1
Training Data Shape: (40000, 1)
Production Data (Features) Shape: (10000, 1)
Producti

### Data cleaning

In [10]:

# Count the null values in each column of train_data
train_data_null_counts = train_data.isnull().sum()

# Count the null values in each column of prod_data
prod_data_null_counts = prod_data.isnull().sum()

# Count the null values in each column of prod_labels
prod_labels_null_counts = prod_labels.isnull().sum()

# Print the null value counts for train_data
print("Null value counts for train_data:")
print(train_data_null_counts)

# Print the null value counts for prod_data
print("\nNull value counts for prod_data:")
print(prod_data_null_counts)

# Print the null value counts for prod_labels
print("\nNull value counts for prod_labels:")
print(prod_labels_null_counts)

Null value counts for train_data:
review    0
dtype: int64

Null value counts for prod_data:
review    0
dtype: int64

Null value counts for prod_labels:
sentiment    0
dtype: int64


### Text preprocessing

In [11]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.sentiment import SentimentIntensityAnalyzer

# Download the required NLTK resource
nltk.download('vader_lexicon')

# Load the training dataset
train_data_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/X_train.csv'
train_data = pd.read_csv(train_data_url)

# Perform text cleaning
train_data['clean_review'] = train_data['review'].str.lower()  # Convert to lowercase
train_data['clean_review'] = train_data['clean_review'].str.replace(r'\d+', '')  # Remove numbers
train_data['clean_review'] = train_data['clean_review'].str.replace(r'[^\w\s]', '')  # Remove punctuation

# Perform tokenization
train_data['tokens'] = train_data['clean_review'].apply(word_tokenize)  # Tokenize each review

# Perform stemming
stemmer = PorterStemmer()
train_data['stemmed_tokens'] = train_data['tokens'].apply(lambda tokens: [stemmer.stem(token) for token in tokens])  # Apply stemming

# Perform sentiment analysis
sentiment_analyzer = SentimentIntensityAnalyzer()
train_data['sentiment_score'] = train_data['clean_review'].apply(lambda text: sentiment_analyzer.polarity_scores(text)['compound'])  # Calculate sentiment score

# Print the modified dataset
print(train_data.head())


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/dharanireddy/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
  train_data['clean_review'] = train_data['clean_review'].str.replace(r'\d+', '')  # Remove numbers
  train_data['clean_review'] = train_data['clean_review'].str.replace(r'[^\w\s]', '')  # Remove punctuation


                                              review  \
0  Shame, is a Swedish film in Swedish with Engli...   
1  I know it's rather unfair to comment on a movi...   
2  "Bread" very sharply skewers the conventions o...   
3  After reading tons of good reviews about this ...   
4  During the Civil war a wounded union soldier h...   

                                        clean_review  \
0  shame is a swedish film in swedish with englis...   
1  i know its rather unfair to comment on a movie...   
2  bread very sharply skewers the conventions of ...   
3  after reading tons of good reviews about this ...   
4  during the civil war a wounded union soldier h...   

                                              tokens  \
0  [shame, is, a, swedish, film, in, swedish, wit...   
1  [i, know, its, rather, unfair, to, comment, on...   
2  [bread, very, sharply, skewers, the, conventio...   
3  [after, reading, tons, of, good, reviews, abou...   
4  [during, the, civil, war, a, wounded, union

In [3]:
# Load the production dataset (features)
prod_data_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/X_final.csv'
prod_data = pd.read_csv(prod_data_url)

# Perform text cleaning
prod_data['clean_review'] = prod_data['review'].str.lower()  # Convert to lowercase
prod_data['clean_review'] = prod_data['clean_review'].str.replace(r'\d+', '')  # Remove numbers
prod_data['clean_review'] = prod_data['clean_review'].str.replace(r'[^\w\s]', '')  # Remove punctuation

# Perform tokenization
prod_data['tokens'] = prod_data['clean_review'].apply(word_tokenize)  # Tokenize each review

# Perform stemming
stemmer = PorterStemmer()
prod_data['stemmed_tokens'] = prod_data['tokens'].apply(lambda tokens: [stemmer.stem(token) for token in tokens])  # Apply stemming

# Perform sentiment analysis
sentiment_analyzer = SentimentIntensityAnalyzer()
prod_data['sentiment_score'] = prod_data['clean_review'].apply(lambda text: sentiment_analyzer.polarity_scores(text)['compound'])  # Calculate sentiment score

# Display the modified production dataset
print(prod_data.head())


  prod_data['clean_review'] = prod_data['clean_review'].str.replace(r'\d+', '')  # Remove numbers
  prod_data['clean_review'] = prod_data['clean_review'].str.replace(r'[^\w\s]', '')  # Remove punctuation


                                              review  \
0  I first saw Heimat 2 on BBC2 in the 90's when ...   
1  I sat down to watch "Midnight Cowboy" thinking...   
2  I can never fathom why people take time to rev...   
3  With that line starts one silly, boring Britis...   
4  Here's the spoiler: At the end of the movie, a...   

                                        clean_review  \
0  i first saw heimat  on bbc in the s when i was...   
1  i sat down to watch midnight cowboy thinking i...   
2  i can never fathom why people take time to rev...   
3  with that line starts one silly boring british...   
4  heres the spoiler at the end of the movie a li...   

                                              tokens  \
0  [i, first, saw, heimat, on, bbc, in, the, s, w...   
1  [i, sat, down, to, watch, midnight, cowboy, th...   
2  [i, can, never, fathom, why, people, take, tim...   
3  [with, that, line, starts, one, silly, boring,...   
4  [heres, the, spoiler, at, the, end, of, the

In [13]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Load the train data
train_data_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/X_train.csv'
train_data = pd.read_csv(train_data_url)

# Define a function to perform the cleaning process
def clean_text(text):
    # Tokenize the text into words
    tokens = word_tokenize(text)

    # Convert the tokens to lowercase
    tokens = [token.lower() for token in tokens]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Remove punctuation
    tokens = [token for token in tokens if token not in string.punctuation]

    # Remove special characters and numbers
    tokens = [token for token in tokens if not any(c.isdigit() or c in string.punctuation for c in token)]

    # Join the tokens back into a single string
    cleaned_text = ' '.join(tokens)

    return cleaned_text

# Apply the cleaning function to the 'review' column
train_data['cleaned_review'] = train_data['review'].apply(clean_text)

# Display the cleaned data
print(train_data['cleaned_review'])


0        shame swedish film swedish english subtitles f...
1        know rather unfair comment movie without seein...
2        bread sharply skewers conventions horror movie...
3        reading tons good reviews movie decided take s...
4        civil war wounded union soldier hides isolated...
                               ...                        
39995    pagan must say movie little magickal significa...
39996    lot comments seem treat film baseball movie fe...
39997    seen series since leave tv background noise br...
39998    dollars wedding ring scene riot also guffawed ...
39999    king kong stripped top remake breathless know ...
Name: cleaned_review, Length: 40000, dtype: object


In [14]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Initialize the VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Compute the sentiment scores for each cleaned review
sentiment_scores = train_data['cleaned_review'].apply(lambda x: sia.polarity_scores(x))

# Print the sentiment scores
for i, score in enumerate(sentiment_scores):
    print(f"Review {i+1} Sentiment Score: {score}")


Review 1 Sentiment Score: {'neg': 0.145, 'neu': 0.45, 'pos': 0.405, 'compound': 0.997}
Review 2 Sentiment Score: {'neg': 0.228, 'neu': 0.461, 'pos': 0.311, 'compound': 0.8024}
Review 3 Sentiment Score: {'neg': 0.228, 'neu': 0.592, 'pos': 0.181, 'compound': -0.6745}
Review 4 Sentiment Score: {'neg': 0.154, 'neu': 0.503, 'pos': 0.343, 'compound': 0.9783}
Review 5 Sentiment Score: {'neg': 0.151, 'neu': 0.618, 'pos': 0.232, 'compound': 0.9787}
Review 6 Sentiment Score: {'neg': 0.189, 'neu': 0.553, 'pos': 0.258, 'compound': 0.7003}
Review 7 Sentiment Score: {'neg': 0.049, 'neu': 0.717, 'pos': 0.234, 'compound': 0.9863}
Review 8 Sentiment Score: {'neg': 0.215, 'neu': 0.624, 'pos': 0.161, 'compound': -0.6518}
Review 9 Sentiment Score: {'neg': 0.015, 'neu': 0.524, 'pos': 0.461, 'compound': 0.9938}
Review 10 Sentiment Score: {'neg': 0.143, 'neu': 0.717, 'pos': 0.139, 'compound': -0.0754}
Review 11 Sentiment Score: {'neg': 0.107, 'neu': 0.62, 'pos': 0.273, 'compound': 0.9916}
Review 12 Sentiment

### Feature Extraction

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the cleaned reviews into TF-IDF vectors
X_train_vectorized = vectorizer.fit_transform(train_data['cleaned_review'])

# Print the shape of the vectorized data
print('Training data (vectorized):')
print(X_train_vectorized.shape)


Training data (vectorized):
(40000, 88246)


In [16]:
# Import the required libraries
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Define a function to perform the cleaning process
def clean_text(text):
    # Tokenize the text into words
    tokens = word_tokenize(text)

    # Convert the tokens to lowercase
    tokens = [token.lower() for token in tokens]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Remove punctuation
    tokens = [token for token in tokens if token not in string.punctuation]

    # Remove special characters and numbers
    tokens = [token for token in tokens if not any(c.isdigit() or c in string.punctuation for c in token)]

    # Join the tokens back into a single string
    cleaned_text = ' '.join(tokens)

    return cleaned_text

# Load the production dataset
prod_data_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/X_final.csv'
prod_data = pd.read_csv(prod_data_url)

# Perform cleaning on the production dataset
prod_data['cleaned_review'] = prod_data['review'].apply(clean_text)

# Print the cleaned production dataset
print('Cleaned production dataset:')
print(prod_data['cleaned_review'])


Cleaned production dataset:
0       first saw heimat art college living moving amo...
1       sat watch midnight cowboy thinking would anoth...
2       never fathom people take time review movies un...
3       line starts one silly boring british sci fi fi...
4       spoiler end movie little piece dies spend rest...
                              ...                        
9995    protocol picture starring goldie hawn bubbly c...
9996    vein natural born killers another movie popula...
9997    sadly inferior precursor afraid virginia woolf...
9998    real hoot unintentionally sidney portier chara...
9999    deal clothes dressed like something late early...
Name: cleaned_review, Length: 10000, dtype: object


In [17]:
# Transform the cleaned production reviews into TF-IDF vectors
X_prod_vectorized = vectorizer.transform(prod_data['cleaned_review'])

# Print the shape of the vectorized data
print('Production data (vectorized):')
print(X_prod_vectorized.shape)


Production data (vectorized):
(10000, 88246)


### Data Splitting

In [18]:
from sklearn.model_selection import train_test_split

# Load the training labels
train_labels_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/y_train.csv'
train_labels = pd.read_csv(train_labels_url)

# Split the training dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_vectorized, train_labels['sentiment'], test_size=0.2, random_state=42)

# Print the shapes of the split datasets
print('Training data:')
print(X_train.shape)
print(y_train.shape)
print('Validation data:')
print(X_val.shape)
print(y_val.shape)

# Load the production labels
prod_labels_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/y_final.csv'
prod_labels = pd.read_csv(prod_labels_url)

# Split the production dataset (optional)
X_prod_train, X_prod_val, y_prod_train, y_prod_val = train_test_split(X_prod_vectorized, prod_labels['sentiment'], test_size=0.2, random_state=42)

# Print the shapes of the split production datasets
print('Production training data:')
print(X_prod_train.shape)
print(y_prod_train.shape)
print('Production validation data:')
print(X_prod_val.shape)
print(y_prod_val.shape)


Training data:
(32000, 88246)
(32000,)
Validation data:
(8000, 88246)
(8000,)
Production training data:
(8000, 88246)
(8000,)
Production validation data:
(2000, 88246)
(2000,)


In [19]:
from sklearn.model_selection import train_test_split

# Load the training dataset
train_data_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/X_train.csv'
train_data = pd.read_csv(train_data_url)

# Load the training labels
train_labels_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/y_train.csv'
train_labels = pd.read_csv(train_labels_url)

# Split the training dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_data['review'], train_labels['sentiment'], test_size=0.2, random_state=42)

# Print the shapes of the split datasets
print('Training data:')
print(X_train.shape)
print(y_train.shape)
print('Validation data:')
print(X_val.shape)
print(y_val.shape)

# Load the production dataset
prod_data_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/X_final.csv'
prod_data = pd.read_csv(prod_data_url)

# Load the production labels
prod_labels_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/y_final.csv'
prod_labels = pd.read_csv(prod_labels_url)

# Split the production dataset (optional)
X_prod_train, X_prod_val, y_prod_train, y_prod_val = train_test_split(prod_data['review'], prod_labels['sentiment'], test_size=0.2, random_state=42)

# Print the shapes of the split production datasets
print('Production training data:')
print(X_prod_train.shape)
print(y_prod_train.shape)
print('Production validation data:')
print(X_prod_val.shape)
print(y_prod_val.shape)


Training data:
(32000,)
(32000,)
Validation data:
(8000,)
(8000,)
Production training data:
(8000,)
(8000,)
Production validation data:
(2000,)
(2000,)


In [20]:
# Load the labels dataset
train_labels_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/y_train.csv'
train_labels = pd.read_csv(train_labels_url)

# Convert string labels to numerical labels
label_mapping = {'negative': 0, 'positive': 1}
y_train_numeric = train_labels['sentiment'].map(label_mapping)

# Split the training dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_vectorized, y_train_numeric, test_size=0.2, random_state=42)

# Print the shapes of the split datasets
print('Training data:')
print(X_train.shape)
print(y_train.shape)
print('Validation data:')
print(X_val.shape)
print(y_val.shape)


Training data:
(32000, 88246)
(32000,)
Validation data:
(8000, 88246)
(8000,)


### Selecting and traing a model

In [36]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load the training dataset
train_data_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/X_train.csv'
train_data = pd.read_csv(train_data_url)

# Load the training labels
train_labels_url = 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/final/y_train.csv'
train_labels = pd.read_csv(train_labels_url)

# Split the training dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_data['review'], train_labels['sentiment'], test_size=0.2, random_state=42)

# Tokenize and remove punctuation
tokenizer = CountVectorizer().build_tokenizer()
X_train_tokenized = X_train.apply(lambda x: ' '.join(tokenizer(x)))
X_val_tokenized = X_val.apply(lambda x: ' '.join(tokenizer(x)))

# Apply one-hot encoding
encoder = CountVectorizer(binary=True)
X_train_encoded = encoder.fit_transform(X_train_tokenized)
X_val_encoded = encoder.transform(X_val_tokenized)

# Train and evaluate Logistic Regression model
lr_model = LogisticRegression()
lr_model.fit(X_train_encoded, y_train)
lr_pred = lr_model.predict(X_val_encoded)
accuracy = accuracy_score(y_val, lr_pred)

print("Accuracy:", accuracy)


Accuracy: 0.890125


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [38]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Tokenize and remove punctuation
tokenizer = CountVectorizer().build_tokenizer()
X_train_tokenized = X_train.apply(lambda x: ' '.join(tokenizer(x)))
X_val_tokenized = X_val.apply(lambda x: ' '.join(tokenizer(x)))
X_prod_train_tokenized = X_prod_train.apply(lambda x: ' '.join(tokenizer(x)))
X_prod_val_tokenized = X_prod_val.apply(lambda x: ' '.join(tokenizer(x)))

# Apply one-hot encoding
encoder = CountVectorizer(binary=True)
X_train_encoded = encoder.fit_transform(X_train_tokenized)
X_val_encoded = encoder.transform(X_val_tokenized)
X_prod_train_encoded = encoder.transform(X_prod_train_tokenized)
X_prod_val_encoded = encoder.transform(X_prod_val_tokenized)

# Train and evaluate Logistic Regression model
lr_model = LogisticRegression()
lr_model.fit(X_train_encoded, y_train)
lr_pred_val = lr_model.predict(X_val_encoded)
lr_pred_prod = lr_model.predict(X_prod_val_encoded)

accuracy_val = accuracy_score(y_val, lr_pred_val)
accuracy_prod = accuracy_score(y_prod_val, lr_pred_prod)

print("Validation Accuracy:", accuracy_val)
print("Production Accuracy:", accuracy_prod)


Validation Accuracy: 0.890125
Production Accuracy: 0.8915


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [39]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

# Create the pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),  # Preprocessing step: CountVectorizer for tokenization
    ('model', LogisticRegression())     # Model: Logistic Regression
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Evaluate the pipeline on the validation data
accuracy = pipeline.score(X_val, y_val)
print("Validation Accuracy:", accuracy)

# Predict using the trained pipeline on the production data
production_accuracy = pipeline.score(X_prod_val, y_prod_val)
print("Production Accuracy:", production_accuracy)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Validation Accuracy: 0.891875
Production Accuracy: 0.8815


In [41]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Define the pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', LogisticRegression())
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predict on the training data
y_train_pred = pipeline.predict(X_train)

# Calculate the accuracy on the training data
train_accuracy = accuracy_score(y_train, y_train_pred)

# Print the accuracy
print("Training Accuracy:", train_accuracy)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training Accuracy: 0.95621875


In [42]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

# Define the pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', LogisticRegression())
])

# Define the parameter grid
param_grid = {
    'vectorizer__ngram_range': [(1, 1), (1, 2)],  # n-gram range for CountVectorizer
    'classifier__C': [0.1, 1.0, 10.0],  # regularization parameter for logistic regression
    'classifier__penalty': ['l1', 'l2']  # penalty type for logistic regression
}

# Perform grid search cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best Parameters: {'classifier__C': 0.1, 'classifier__penalty': 'l2', 'vectorizer__ngram_range': (1, 2)}
Best Score: 0.9015000000000001
Validation Accuracy: 0.9085


In [43]:
# Print the best parameters and best score
print("Best Parameters:", best_params)
print("Best Score:", best_score)

Best Parameters: {'classifier__C': 0.1, 'classifier__penalty': 'l2', 'vectorizer__ngram_range': (1, 2)}
Best Score: 0.9015000000000001


In [44]:
# Evaluate the model on the validation set
validation_accuracy = grid_search.score(X_val, y_val)
print("Validation Accuracy:", validation_accuracy)


Validation Accuracy: 0.9085


In [64]:
# Import the necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the training dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_data['review'], train_labels['sentiment'], test_size=0.2, random_state=42)

# Vectorize the text data using CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_val_vectorized = vectorizer.transform(X_val)

# Train the Naive Bayes model
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_vectorized, y_train)

# Make predictions on the validation set
y_val_pred = naive_bayes.predict(X_val_vectorized)

# Calculate the accuracy of the model on the validation set
accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", accuracy)


Validation Accuracy: 0.84475


In [65]:
# Import the necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the training dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_data['review'], train_labels['sentiment'], test_size=0.2, random_state=42)

# Vectorize the text data using CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_val_vectorized = vectorizer.transform(X_val)

# Train the Decision Tree model
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train_vectorized, y_train)

# Make predictions on the validation set
y_val_pred = decision_tree.predict(X_val_vectorized)

# Calculate the accuracy of the model on the validation set
accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", accuracy)


Validation Accuracy: 0.7215


### Conclusion

Among the three models tested, logistic regression achieved the highest validation accuracy of 0.9085, outperforming the decision tree (0.7215) and naive Bayes (0.84475) models. Therefore, the logistic regression model is considered the best performer for sentiment analysis.

### Challenges

1. Data Preprocessing: I encountered issues such as noisy data, spelling errors, special characters, and inconsistent text formats. I invested time in cleaning and preparing the data to ensure it was suitable for analysis.

2. Imbalanced Classes: The dataset had imbalanced class distributions, which affected the model's ability to accurately predict minority classes. I addressed this challenge by applying resampling techniques and using appropriate evaluation metrics.

3. Feature Engineering: Extracting meaningful features from the text data was a significant challenge. To transform the raw text into relevant features, I experimented with numerous strategies such as word stemming, tokenization, and other text processing methods. 

4. Model Selection: Choosing the most suitable machine learning algorithm for sentiment analysis was challenging. I evaluated multiple models, considering their strengths and weaknesses, to select the best one for my dataset and problem.

5. Overfitting or Underfitting: There is a risk of overfitting or underfitting the model, where the model either becomes too complex and captures noise in the data, or is too simple and fails to capture the underlying patterns in the data.


### What can be done further

Further Steps:

1. Explore advanced text processing techniques to enhance sentiment analysis.
2. Experiment with additional feature engineering methods for better sentiment classification.
3. Deploy the sentiment analysis model in a real-time production environment.

By implementing these steps, we can improve the accuracy and performance of the sentiment analysis system.