This Lab is Data Preprocessing Project

## Lab sentiment analysis (NLP)


# Table of Contents 
<ol start="1">
<li> About the project</li>
<li> Loading and Cleaning with Pandas</li>
<li> Data cleaning </li>
<li> Feature engineering </li>
<li> Text Preprocessing </li>
<li> build Model </li>
<li> Visualization </li>
<li> Conclusion </li>
</ol>

## Problem Statement
The surge in MOOC learner reviews demands efficient data preprocessing for subsequent model readiness. Manual analysis is impractical due to the data volume. This project focuses on crucial preprocessing steps—cleaning, feature engineering, normalization, and transformation—to optimize the data for sentiment analysis models. The goal is to enable MOOC platforms to derive actionable insights, improve course quality, and identify areas for enhancement through systematic and effective data preparation. 

## Data Source
Udemy Courses- Comments.csv- Kaggle: "This dataset contains detailed information on all available Udemy courses on Oct 10, 2022. This data was provided in the "Course_info.csv" file. Also, over 9 million comments were collected and provided in the "Comments.csv" file. The information of over 209k courses was collected by web scraping the Udemy website. Udemy holds 209,734 courses and 73,514 instructors teaching courses in 79 languages in 13 different categories." --Kaggle.
In this Project we only use Comments.csv.

import appropriate libraries

In [111]:
# Importing necessary libraries
import numpy as np   # NumPy for numerical operations
import pandas as pd  # Pandas for data manipulation
from nltk.tokenize import word_tokenize  # NLTK for natural language processing - tokenization
from nltk.corpus import stopwords  # NLTK for stop words
from nltk.stem.lancaster import LancasterStemmer  # NLTK for stemming
from nltk.stem.wordnet import WordNetLemmatizer  # NLTK for lemmatization
from sklearn.feature_extraction.text import TfidfVectorizer  # Scikit-learn for TF-IDF vectorization
from sklearn.model_selection import train_test_split  # Scikit-learn for train-test split
from tqdm import tqdm  # tqdm for progress bars
from bs4 import BeautifulSoup  # BeautifulSoup for HTML parsing


# Importing emot library for emotion analysis
# !pip install emot
import emot 

# Regular expression library for text processing
import re

# Language detection library
# !pip install langdetect
from langdetect import detect

# Translators library for language translation
# !pip install translators
import translators as ts

###  Reading in the data

In [112]:
df=pd.read_csv('Comments.csv')
# get only 1000000 rows (from over 9,000,000 records)
df=df.head(1000000)

### Inspecting and Performing Basic Operations on the 
check columns information with types and null values etc...

In [113]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column        Non-Null Count    Dtype  
---  ------        --------------    -----  
 0   id            1000000 non-null  int64  
 1   course_id     1000000 non-null  int64  
 2   rate          1000000 non-null  float64
 3   date          1000000 non-null  object 
 4   display_name  999519 non-null   object 
 5   comment       999856 non-null   object 
dtypes: float64(1), int64(2), object(3)
memory usage: 45.8+ MB


In [114]:
df.describe()

Unnamed: 0,id,course_id,rate
count,1000000.0,1000000.0,1000000.0
mean,69118450.0,2584639.0,4.382235
std,39457120.0,1365302.0,1.047236
min,282.0,5664.0,0.5
25%,35026470.0,1435326.0,4.0
50%,72187220.0,2630370.0,5.0
75%,106525300.0,3785872.0,5.0
max,126709000.0,4913148.0,5.0


In [115]:
# Display the first 5 rows  from the DataFrame
df.head(5)

Unnamed: 0,id,course_id,rate,date,display_name,comment
0,88962892,3173036,1.0,2021-06-29T18:54:25-07:00,Rahul,I think a beginner needs more than you think.\...
1,125535470,4913148,5.0,2022-10-07T11:17:41-07:00,Marlo,Aviva is such a natural teacher and healer/hea...
2,68767147,3178386,3.5,2020-10-19T06:35:37-07:00,Yamila Andrea,Muy buena la introducción para entender la bas...
3,125029758,3175814,5.0,2022-09-30T21:13:49-07:00,Jacqueline,This course is the best on Udemy. This breakd...
4,76584052,3174896,4.5,2021-01-30T08:45:11-08:00,Anthony,I found this course very helpful. It was full ...


In [116]:
# Display a random sample of 5 rows from the DataFrame
df.sample(5)

Unnamed: 0,id,course_id,rate,date,display_name,comment
146704,106269742,3188554,5.0,2022-01-29T11:46:29-08:00,Troy,"Relatable, encouraging, well laid out, convers..."
328693,79361144,3265854,4.0,2021-03-05T13:41:59-08:00,Donna,"Great information, quickly. Helps if you alrea..."
43905,92815376,4237450,5.0,2021-08-13T09:05:40-07:00,Bicky,It was excellent course
901213,40767162,1437626,1.0,2019-11-24T23:12:15-08:00,Tanvi,The focus of this course is on formatting. The...
824064,47375716,2034308,4.5,2020-03-15T23:17:07-07:00,Julio César,"En general un buen curso, me sirvió para refor..."


In [117]:
# Checking and Printing the number of duplicate values
dup_count = df.duplicated().sum()

print(f"There are {dup_count} duplicate values in the dataset")

There are 0 duplicate values in the dataset


In [118]:
# Display the list of column names in the DataFrame
for col in df.columns.to_list():
    print(col)

id
course_id
rate
date
display_name
comment


In [119]:
df.dtypes # Display the data types of each column in the DataFrame

id                int64
course_id         int64
rate            float64
date             object
display_name     object
comment          object
dtype: object

In [120]:
df.isnull().sum() # Check the number of null values in each column of the DataFrame

id                0
course_id         0
rate              0
date              0
display_name    481
comment         144
dtype: int64

# Data Cleaning
### Manage features and remove unnecessary ones

In [121]:
# Convert the 'date' column to datetime format using pandas
df['date'] = pd.to_datetime(df['date'])

In [122]:
# to check if the values of "date" column were converted to datetime format
df.head(2)

Unnamed: 0,id,course_id,rate,date,display_name,comment
0,88962892,3173036,1.0,2021-06-29 18:54:25-07:00,Rahul,I think a beginner needs more than you think.\...
1,125535470,4913148,5.0,2022-10-07 11:17:41-07:00,Marlo,Aviva is such a natural teacher and healer/hea...


In [123]:
# Drop rows with null values in the 'comment' column
df=df.dropna(subset=['comment'])

In [124]:
df.isnull().sum() # Check the removal of null values in the comment column was successful

id                0
course_id         0
rate              0
date              0
display_name    480
comment           0
dtype: int64

In [125]:
# remove duplicate comments by id 
df=df.drop_duplicates(subset=['id'], keep='first')

# drop the id, display_name and course_id columns because they're not needed anymore
df=df.drop(['id','display_name','course_id'], axis=1)


In [126]:
df.count() # Check the removal of rows with duplicate comment fields and columns (id,display_name, course_id)  was successful

rate       999856
date       999856
comment    999856
dtype: int64

In [127]:
# Function to determine if a comment contains only numbers, blanks, or special characters
def contains_only_numbers_or_special_chars(comment):
    # Remove special characters and spaces
    cleaned_text = re.sub('\W+', '',comment)
    # Remove special numbers
    cleaned_text =re.sub(r'\d+', '', cleaned_text)
    return cleaned_text.strip()==''

In [128]:
# Apply filter to remove comments with only numbers, blanks, or special characters (unmeaningful comments)
df=df[~df['comment'].apply(contains_only_numbers_or_special_chars)]

In [129]:
# reset index
df.reset_index(drop=True)

Unnamed: 0,rate,date,comment
0,1.0,2021-06-29 18:54:25-07:00,I think a beginner needs more than you think.\...
1,5.0,2022-10-07 11:17:41-07:00,Aviva is such a natural teacher and healer/hea...
2,3.5,2020-10-19 06:35:37-07:00,Muy buena la introducción para entender la bas...
3,5.0,2022-09-30 21:13:49-07:00,This course is the best on Udemy. This breakd...
4,4.5,2021-01-30 08:45:11-08:00,I found this course very helpful. It was full ...
...,...,...,...
997809,5.0,2022-09-26 09:08:15-07:00,"Bem teorico, porem interessante, da uma boa id..."
997810,5.0,2022-10-02 14:00:08-07:00,Muito bom para organizar as ideias antes de co...
997811,4.5,2019-12-05 23:38:32-08:00,Le cours est très bien mais certains exercices...
997812,5.0,2019-12-11 21:19:14-08:00,"C'est un cours ordonné, et stimulant. L'instru..."


### get a sample to work on with

In [130]:
# Generate separate DataFrames based on rate conditions 
shape=800
df_0_1 = df[(df['rate'] >= 0) & (df['rate'] < 1)].head(shape)
df_1_2 = df[(df['rate'] >= 1) & (df['rate'] < 2)].head(shape)
df_2_3 = df[(df['rate'] >= 2) & (df['rate'] < 3)].head(shape)
df_3_4 = df[(df['rate'] >= 3) & (df['rate'] < 4)].head(shape)
df_4_5 = df[(df['rate'] >= 4) & (df['rate'] <= 5)].head(shape)

# Concatenate the DataFrames
df = pd.concat([df_0_1, df_1_2, df_2_3,df_3_4, df_4_5])
df

Unnamed: 0,rate,date,comment
597,0.5,2020-07-30 17:47:36-07:00,ничего интересного я на этом курсе не узнал....
997,0.5,2021-01-28 18:05:20-08:00,I need my certificate.
1417,0.5,2021-11-20 08:48:40-08:00,This is a very poor course dont this course. W...
3538,0.5,2021-05-21 04:28:46-07:00,Is just a quiz without explanations
3742,0.5,2017-04-30 00:08:39-07:00,the voice of this instructor is absolutely not...
...,...,...,...
1060,5.0,2022-10-17 14:56:19-07:00,Awesome course
1061,4.0,2020-12-11 05:38:38-08:00,"Kurs spełnił moje oczekiwania, zrozumiale prze..."
1062,5.0,2022-07-25 00:54:40-07:00,"Wonderful. Today, for the first time I got cla..."
1063,4.5,2020-11-23 22:37:33-08:00,this course vary help ful gand grate


# Feature engineering


- add detected_language column for the comment 
- add sentiment column based on rate 
- translate the non english comments

In [131]:
# Function to detect language (similar to the previous example)
def detect_language(comment):
    try:
        return detect(str(comment))
    except:
        return 'undetermined'  # Handle cases where language detection fails

df['detected_language'] = df['comment'].apply(detect_language)

df

Unnamed: 0,rate,date,comment,detected_language
597,0.5,2020-07-30 17:47:36-07:00,ничего интересного я на этом курсе не узнал....,ru
997,0.5,2021-01-28 18:05:20-08:00,I need my certificate.,en
1417,0.5,2021-11-20 08:48:40-08:00,This is a very poor course dont this course. W...,en
3538,0.5,2021-05-21 04:28:46-07:00,Is just a quiz without explanations,en
3742,0.5,2017-04-30 00:08:39-07:00,the voice of this instructor is absolutely not...,en
...,...,...,...,...
1060,5.0,2022-10-17 14:56:19-07:00,Awesome course,en
1061,4.0,2020-12-11 05:38:38-08:00,"Kurs spełnił moje oczekiwania, zrozumiale prze...",pl
1062,5.0,2022-07-25 00:54:40-07:00,"Wonderful. Today, for the first time I got cla...",en
1063,4.5,2020-11-23 22:37:33-08:00,this course vary help ful gand grate,en


In [132]:
# how many language are not recognized
df[df["detected_language"]=="undetermined"].size

0

### Translate the Comments

In [133]:
#  do not transalte english comments
non_english_comments=df[(df['detected_language'] != 'en')]

In [134]:
from tqdm import tqdm

for index, row in tqdm(non_english_comments.iterrows(), total=len(non_english_comments), desc="Translating Comments"):
    try:
        non_english_comments.at[index, 'cleaned_comment'] = ts.translate_text(row['comment'])
    except:
        row['cleaned_comment'] = "Error in translation"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_english_comments.at[index, 'cleaned_comment'] = ts.translate_text(row['comment'])
Translating Comments: 100%|████████████████████████████████████████████████████████| 1509/1509 [13:47<00:00,  1.82it/s]


In [135]:
non_english_comments

Unnamed: 0,rate,date,comment,detected_language,cleaned_comment
597,0.5,2020-07-30 17:47:36-07:00,ничего интересного я на этом курсе не узнал....,ru,I didn't learn anything interesting in this co...
16231,0.5,2017-08-29 11:38:38-07:00,Comprei um curso sobre um assunto e metade del...,pt,I bought a course on a subject and half of it ...
24220,0.5,2017-12-12 09:00:57-08:00,"Habla extraño, aunque con eso no me meto, pero...",es,"He talks strangely, although I don't mess with..."
34698,0.5,2016-05-10 12:39:18-07:00,dumb,fr,dumb
40702,0.5,2018-03-19 12:59:01-07:00,More explanation,fr,More explanation
...,...,...,...,...,...
1048,5.0,2022-02-10 07:26:04-08:00,"Très bonne formation, complète et simple à sui...",fr,"Very good training, complete and simple to fol..."
1050,5.0,2022-09-14 23:46:37-07:00,"Olá, Sue. Achei o curso bom. Atualizei a sua n...",pt,"Hello, Sue. I thought the course was good. I'v..."
1052,5.0,2021-08-27 09:00:40-07:00,Es sind sehr gute Anregungenn dabei. Danke.,de,There are very good suggestions. Thank you.
1061,4.0,2020-12-11 05:38:38-08:00,"Kurs spełnił moje oczekiwania, zrozumiale prze...",pl,"The course met my expectations, comprehensible..."


In [136]:
# how many errors at translation
print(f"Errors number : {len(non_english_comments[non_english_comments['cleaned_comment'] == 'Error in translation'])}")

Errors number : 0


In [137]:
# create new column cleaned comment to use just after
df['cleaned_comment']=df['comment']
# mergre non_english_comments to origin df 
df.update(non_english_comments)
# remove the cooemnt with error in translation
df = df[df['cleaned_comment'] != "Error in translation"]
df

Unnamed: 0,rate,date,comment,detected_language,cleaned_comment
597,0.5,2020-07-30 17:47:36-07:00,ничего интересного я на этом курсе не узнал....,ru,I didn't learn anything interesting in this co...
997,0.5,2021-01-28 18:05:20-08:00,I need my certificate.,en,I need my certificate.
1417,0.5,2021-11-20 08:48:40-08:00,This is a very poor course dont this course. W...,en,This is a very poor course dont this course. W...
3538,0.5,2021-05-21 04:28:46-07:00,Is just a quiz without explanations,en,Is just a quiz without explanations
3742,0.5,2017-04-30 00:08:39-07:00,the voice of this instructor is absolutely not...,en,the voice of this instructor is absolutely not...
...,...,...,...,...,...
1060,5.0,2022-10-17 14:56:19-07:00,Awesome course,en,Awesome course
1061,4.0,2020-12-11 05:38:38-08:00,"Kurs spełnił moje oczekiwania, zrozumiale prze...",pl,"The course met my expectations, comprehensible..."
1062,5.0,2022-07-25 00:54:40-07:00,"Wonderful. Today, for the first time I got cla...",en,"Wonderful. Today, for the first time I got cla..."
1063,4.5,2020-11-23 22:37:33-08:00,this course vary help ful gand grate,en,this course vary help ful gand grate


In [138]:
# set for each comment a sentiment based on rate
for index, row in df.iterrows():
    rate = row['rate']
    if rate < 2:
        df.at[index, 'sentiment'] = 'negative'
    elif 2 <= rate <=3:
        df.at[index, 'sentiment'] = 'neutral'
    else:
        df.at[index, 'sentiment'] = 'positive'

# Text Preprocessing 
-  convert emoji and emoticon into words
-  convert conments to lower case
-  remove links using regular expressions
-  remove HTML tags
-  remove_numbers_or_special_chars

In [139]:
#  convert emoji and emoticon into words
emot_obj = emot.core.emot() 
def replace_emojis_with_meanings(text):
    emojis = emot_obj.emoji(text)['value']
    emojis_meanings =  emot_obj.emoji(text)['mean']
    emoticons = emot_obj.emoticons(text)['value']
    emoticons_meanings =  emot_obj.emoticons(text)['mean']

    # Remove special characters from meanings list
    emojis_meanings = [re.sub(r'[^\w\s]', '', meaning) for meaning in emojis_meanings]
     # Remove special characters from meanings list
    emoticons_meanings = [re.sub(r'[^\w\s]', '', meaning) for meaning in emoticons_meanings]

    # Replace emojis in the text with the corresponding meanings
    for emoji, meaning in zip(emojis, emojis_meanings):
        text = text.replace(emoji, meaning)
    
     # Replace emojis in the text with the corresponding meanings
    for emoji, meaning in zip(emoticons, emoticons_meanings):
        text = text.replace(emoji, meaning)

    return text

In [140]:
# convert conments to lower case
def convert_to_lower(comment):
    return comment.lower()

In [141]:
# Function to remove links using regular expressions
def remove_links(comment):
    # Regular expression pattern to match URLs
    pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return re.sub(pattern, '', comment)

In [142]:
# Function to remove HTML tags
def remove_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    # Get the text without HTML tags
    text_without_tags = soup.get_text(separator=" ", strip=True)
    return text_without_tags

In [143]:
# Function to remove_numbers_or_special_chars
def remove_numbers_or_special_chars(comment):
     # Replace escape sequences with an empty string
    cleaned_text = re.sub(r'\\[^\s]', ' ', comment)
    # convert n't into not EX(didn't => did not)
    cleaned_text= cleaned_text.replace("n't", " not")
    cleaned_text= cleaned_text.replace("_", " ")
    # Remove special characters and spaces
    cleaned_text = re.sub('\W+', ' ',cleaned_text)    
    # Remove special numbers
    cleaned_text =re.sub(r'\d+', ' ', cleaned_text)
    return cleaned_text


In [144]:
# apply cleaning functions on comments
def clean_text(comment):
    text_res = replace_emojis_with_meanings(comment)
    text_res = convert_to_lower(text_res)
    text_res = remove_links(text_res)
    text_res = remove_html_tags(text_res)
    text_res = remove_numbers_or_special_chars(text_res)
    return text_res

In [145]:
df['cleaned_comment'] = df['cleaned_comment'].apply(clean_text)

  soup = BeautifulSoup(text, "html.parser")


## Tokanization , lemmatization

In [146]:
# remove stop words and lemmatization
stop_words = set(stopwords.words('english'))
# Words to keep

words_to_keep = {'very','too','so','not','no','but'}
# keep words like "very", and "so"
stop_words = {word for word in stop_words if  word not in words_to_keep}
 
lemmatizer = WordNetLemmatizer()

def preprocess_text(comment):
    try:
        tokens = word_tokenize(comment.lower())  # Tokenization
        tokens = [token for token in tokens if token.isalpha()]  # Remove non-alphabetic tokens
        tokens = [token for token in tokens if token not in stop_words]  # Remove stop words
        tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatization
        return ' '.join(tokens)
    except:
        print(comment)

In [147]:
df['clean_text'] = df['cleaned_comment'].apply(preprocess_text)
df

Unnamed: 0,rate,date,comment,detected_language,cleaned_comment,sentiment,clean_text
597,0.5,2020-07-30 17:47:36-07:00,ничего интересного я на этом курсе не узнал....,ru,i did not learn anything interesting in this c...,negative,not learn anything interesting course
997,0.5,2021-01-28 18:05:20-08:00,I need my certificate.,en,i need my certificate,negative,need certificate
1417,0.5,2021-11-20 08:48:40-08:00,This is a very poor course dont this course. W...,en,this is a very poor course dont this course wa...,negative,very poor course dont course waste time money
3538,0.5,2021-05-21 04:28:46-07:00,Is just a quiz without explanations,en,is just a quiz without explanations,negative,quiz without explanation
3742,0.5,2017-04-30 00:08:39-07:00,the voice of this instructor is absolutely not...,en,the voice of this instructor is absolutely not...,negative,voice instructor absolutely not clear very har...
...,...,...,...,...,...,...,...
1060,5.0,2022-10-17 14:56:19-07:00,Awesome course,en,awesome course,positive,awesome course
1061,4.0,2020-12-11 05:38:38-08:00,"Kurs spełnił moje oczekiwania, zrozumiale prze...",pl,the course met my expectations comprehensible ...,positive,course met expectation comprehensible knowledg...
1062,5.0,2022-07-25 00:54:40-07:00,"Wonderful. Today, for the first time I got cla...",en,wonderful today for the first time i got clari...,positive,wonderful today first time got clarity formula...
1063,4.5,2020-11-23 22:37:33-08:00,this course vary help ful gand grate,en,this course vary help ful gand grate,positive,course vary help ful gand grate


# Modeling

### use tfidf for Text encoding

In [168]:
# from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['clean_text'], df['sentiment'], test_size=0.2, random_state=42)

# Define a pipeline with TF-IDF vectorizer and Logistic Regression classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000,ngram_range=(1, 2))),
    ('classifier', LogisticRegression())
])

# Train the model
pipeline.fit(X_train, y_train)

# Make predictions
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)

# Calculate accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Testing Accuracy: {test_accuracy:.4f}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training Accuracy: 0.8998
Testing Accuracy: 0.8967


In [170]:
# Make prediction using our trained model
predicted_pos_sentiment = pipeline.predict(["i did like it"])
predicted_net_sentiment = pipeline.predict(["it is not bad but it is not good also"])
predicted_neg_sentiment = pipeline.predict(["did not love it"])

print(f"predicted pos sentiment: {predicted_pos_sentiment}")
print(f"predicted net sentiment: {predicted_net_sentiment}")
print(f"predicted neg sentiment: {predicted_neg_sentiment}")

predicted pos sentiment: ['positive']
predicted net sentiment: ['neutral']
predicted neg sentiment: ['negative']


### another Model with  word embedding Text encoding

In [150]:
import spacy
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Load the spaCy model with word embeddings (e.g., en_core_web_md for medium-sized embeddings)
nlp = spacy.load("en_core_web_sm")

# Extract word embeddings for each document

texts = df['clean_text']

# Define the maximum length for padding
max_length = 400

# Calculate the fixed-size vectors for each document
X = np.array([np.pad(nlp(text).vector, (0, max_length - len(nlp(text).vector)))[:max_length] for text in tqdm(texts, desc="Creating vectors")])
y = df['sentiment']


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a pipeline with Logistic Regression classifier
pipeline = Pipeline([
    ('classifier', LogisticRegression())
])

# Train the model
pipeline.fit(X_train, y_train)

# Make predictions
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)

# # Calculate accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Testing Accuracy: {test_accuracy:.4f}")



Creating vectors: 100%|████████████████████████████████████████████████████████████| 4000/4000 [00:54<00:00, 73.17it/s]


Training Accuracy: 0.5603
Testing Accuracy: 0.5413


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [151]:
# New text for sentiment prediction
new_text = "it was a very good course"

# Vectorize the new text
new_text_vector = np.pad(nlp(new_text).vector, (0, max_length - len(nlp(new_text).vector)))[:max_length].reshape(1, -1)

# Make predictions using the trained model
predicted_sentiment = pipeline.predict(new_text_vector)

print(f"Predicted Sentiment: {predicted_sentiment[0]}")


Predicted Sentiment: negative


In [152]:
# Make prediction using your trained model
pos_sentiment = "i did liked it "
net_sentiment = "it is not bad but it is not good also"
neg_sentiment = "i did not love it "

# Vectorize the new text
pos_sentiment_vector = np.pad(nlp(pos_sentiment).vector, (0, max_length - len(nlp(pos_sentiment).vector)))[:max_length].reshape(1, -1)
net_sentiment_vector = np.pad(nlp(net_sentiment).vector, (0, max_length - len(nlp(net_sentiment).vector)))[:max_length].reshape(1, -1)
neg_sentiment_vector = np.pad(nlp(neg_sentiment).vector, (0, max_length - len(nlp(neg_sentiment).vector)))[:max_length].reshape(1, -1)

# Make predictions using the trained model
predicted_pos_sentiment = pipeline.predict(pos_sentiment_vector)
predicted_net_sentiment = pipeline.predict(net_sentiment_vector)
predicted_neg_sentiment = pipeline.predict(neg_sentiment_vector)

print(f"predicted pos sentiment: {predicted_pos_sentiment[0]}")
print(f"predicted net sentiment: {predicted_net_sentiment[0]}")
print(f"predicted neg sentiment: {predicted_neg_sentiment[0]}")

predicted pos sentiment: negative
predicted net sentiment: negative
predicted neg sentiment: negative


In [153]:

# Make prediction using your trained model
pos_sentiment = ["i did liked it "]
net_sentiment = ["it is not bad but it is not good also"]
neg_sentiment = ["i did not love it "]


# Get the spaCy word embeddings for the new text
predicted_pos_sentiment = [nlp(text).vector for text in pos_sentiment]
# Get the spaCy word embeddings for the new text
predicted_net_sentiment = [nlp(text).vector for text in net_sentiment]
# Get the spaCy word embeddings for the new text
predicted_neg_sentiment = [nlp(text).vector for text in neg_sentiment]

# Use the trained model to predict sentiment

print(f"predicted pos sentiment: {pipeline.predict(predicted_pos_sentiment)}")
print(f"predicted net sentiment: {pipeline.predict(predicted_net_sentiment)}")
print(f"predicted neg sentiment: {pipeline.predict(predicted_neg_sentiment)}")

ValueError: X has 96 features, but LogisticRegression is expecting 400 features as input.

# use extern Model to testing

### Roberta Model 

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

In [None]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained (MODEL)
model = AutoModelForSequenceClassification.from_pretrained (MODEL)

In [None]:
def polarity_scores_roberta (text):
    encoded_text = tokenizer (text, return_tensors= 'pt')
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax (scores)
    scores_dict ={
    'roberta_neg':scores [0],
    'roberta_neu':scores [1],
    'roberta_pos':scores [2]
    }
    return scores_dict

In [None]:
# Test for Roberta Model
polarity_scores_roberta(df.iloc[40]['clean_text'])

In [None]:
res = {}
for index, row in tqdm(df.iterrows(),total=len(df), desc="Translating Comments"):
    text = row['clean_text']
    try:
        res[index] = polarity_scores_roberta(text)
    except:
        res[index]={}
        print(f"error in index {index}")

In [None]:
res = pd.DataFrame(res).T
# # Merge DataFrames based on index
merged_df = pd.merge(df,res ,left_index=True, right_index=True)

# Visualization

### word cloud

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
plt.figure(figsize = (20,20))
wc = WordCloud(max_words = 200 , width = 1600 , height = 800,
               collocations=False).generate(" ".join(df[df['sentiment']=="positive"]['cleaned_comment']))
plt.imshow(wc)
plt.title("Most used words in positive comments")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
plt.figure(figsize = (20,20))
wc = WordCloud(max_words = 200 , width = 1600 , height = 800,
               collocations=False).generate(" ".join(df[df['sentiment']=="negative"]['cleaned_comment']))
plt.imshow(wc)
plt.title("Most used words in negative comments")

### Distribution of Ratings

In [None]:
import pandas as pd
import matplotlib.pyplot as plt



# Create a pie plot of ratings
plt.figure(figsize=(8, 8))
plt.pie(df['rate'].unique(), labels=df['rate'].unique(), autopct='%1.1f%%', startangle=140, )
plt.title('Distribution of Ratings')
plt.show()


### Number of Comments in Each Year

In [None]:
# Extract the year from the 'date' column the 'year'
years = pd.Series([dt.year for dt in df['date']])

# Count the number of rows for each year
yearly_counts = years.value_counts().sort_index()

# Create a bar plot
plt.bar(yearly_counts.index, yearly_counts.values)

# Set plot labels and title
plt.xlabel('Year')
plt.ylabel('Number of Comments')
plt.title('Number of Comments in Each Year')

# Display the plot
plt.show()

## Other Visualisations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

ax = merged_df['rate'].value_counts().sort_index().plot(kind="bar",title='Count of Reviews by Stars',figsize=(10, 5))
ax.set_xlabel('Review stars') 



In [None]:
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
sns.barplot(data=merged_df, x='rate', y='roberta_pos', ax=axs[0])
sns.barplot(data=merged_df, x='rate', y='roberta_neu', ax=axs[1])
sns.barplot(data=merged_df, x='rate', y='roberta_neg', ax=axs[2])
axs[0].set_title( 'Positive')
axs[1].set_title('Neutral')
axs[2].set_title('Pogitive')

# Conclusion
In conclusion , the sentiment analysis project successfully navigated the challenges posed by the vast volume of MOOC learner feedback. The combination of thorough data preprocessing and advanced NLP techniques paved the way for a robust sentiment analysis model. Through exploratory data analysis, key insights were gleaned, and visualizations provided a nuanced understanding of the dataset. The meticulous cleaning process, feature engineering, and adoption of the RoBERTa model contributed to the creation of an accurate and effective sentiment analysis tool. This project not only streamlined the evaluation of MOOCs but also demonstrated the potential of NLP in extracting valuable insights from unstructured textual data.