# **Interactive Sentiment and Business Intelligence Analysis using Machine Learning on Social Media Platforms**

![image.png](attachment:image.png)

### Dataset - CNN

### Group Members:
Devansh Hukmani, Dhairyav Shah, Keerthi Bhuvan Koduri, Manoj Naidu Yandrapu, Naina 
Gupta, Pariyani Mohammed Yusuf, Pengxiang Zhang, Priya Murali, Sudhamshu Vidyananda, 
Sumit Kamboj, Swathi Nadimpilli


In [2]:
import pandas as pd
import requests
from io import StringIO

# Function to read different types of files
def read_file(file_path, file_type):
    # Check the file type and read accordingly
    if file_type.lower() == 'csv':
        df = pd.read_csv(file_path)  # Read CSV file
    elif file_type.lower() == 'xlsx':
        df = pd.read_excel(file_path)  # Read Excel file
    elif file_type.lower() == 'url':
        response = requests.get(file_path)  # Make a GET request to the URL
        data = response.content.decode('utf-8')  # Decode response content
        df = pd.read_csv(StringIO(data))  # Read CSV data from URL
    elif file_type.lower() == 'json':
        df = pd.read_json(file_path)  # Read JSON file
    elif file_type.lower() == 'text':
        with open(file_path, 'r') as file:
            data = file.read()  # Read text file content
        df = pd.DataFrame({'text': [data]})  # Create DataFrame with text content
    else:
        raise ValueError("Invalid file type. Supported types are: CSV, XLSX, URL, JSON, Text")

    return df

# Input file path or URL and file type from the user
file_path = input("Enter the file path or URL: ")
file_type = input("Enter the file type (CSV, XLSX, URL, JSON, Text): ")

# Read the file using the read_file function
df = read_file(file_path, file_type)
#cnn_5550296508.csv
# Print the first few rows of the DataFrame
print(df.head())

Enter the file path or URL: cnn_5550296508.csv
Enter the file type (CSV, XLSX, URL, JSON, Text): csv
                                id     page_id  \
0  ﻿"5550296508_10150712177946509"  5550296508   
1    ﻿"5550296508_258636547563092"  5550296508   
2  ﻿"5550296508_10150712540566509"  5550296508   
3    ﻿"5550296508_350156181698587"  5550296508   
4    ﻿"5550296508_140431756086124"  5550296508   

                                                name  \
0                                                NaN   
1       How nations risk nuclear terrorism - CNN.com   
2     Facebook wants court to dismiss Ceglia lawsuit   
3  Report: Zimmerman told police teen punched him...   
4  Supreme Court divided over health care mandate...   

                                             message  \
0  Breaking News: French prosecutors: Former IMF ...   
1  CNN Opinion Contributor Richard Chasdi states ...   
2  'Ceglia has forged documents, destroyed eviden...   
3  An Orlando Sentinel report fills i

In [3]:
# Display information about the DataFrame
print("DataFrame Information:")
df.info()

# Display summary statistics of the DataFrame
print("\nDataFrame Descriptive Statistics:")
df.describe()

DataFrame Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31696 entries, 0 to 31695
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              31696 non-null  object
 1   page_id         31696 non-null  int64 
 2   name            29816 non-null  object
 3   message         31091 non-null  object
 4   description     18797 non-null  object
 5   caption         22184 non-null  object
 6   post_type       31696 non-null  object
 7   status_type     31664 non-null  object
 8   likes_count     31696 non-null  int64 
 9   comments_count  31696 non-null  int64 
 10  shares_count    31696 non-null  int64 
 11  love_count      31696 non-null  int64 
 12  wow_count       31696 non-null  int64 
 13  haha_count      31696 non-null  int64 
 14  sad_count       31696 non-null  int64 
 15  thankful_count  31696 non-null  int64 
 16  angry_count     31696 non-null  int64 
 17  link            31248 non-n

Unnamed: 0,page_id,likes_count,comments_count,shares_count,love_count,wow_count,haha_count,sad_count,thankful_count,angry_count
count,31696.0,31696.0,31696.0,31696.0,31696.0,31696.0,31696.0,31696.0,31696.0,31696.0
mean,5550297000.0,4218.814,742.776218,1460.405,134.203149,71.098088,85.498265,114.827139,0.067674,84.207566
std,0.0,13186.57,1964.627591,14406.0,1239.543133,564.055115,576.071482,1554.700428,3.691511,870.131719
min,5550297000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5550297000.0,991.0,148.0,138.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,5550297000.0,1960.0,334.0,353.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5550297000.0,4028.25,809.0,960.0,19.0,21.0,8.0,5.0,0.0,5.0
max,5550297000.0,1155249.0,237266.0,1934157.0,156066.0,45708.0,34904.0,149098.0,509.0,117430.0


In [4]:
# Check for missing values in the DataFrame and sum them up column-wise
df.isnull().sum()

id                    0
page_id               0
name               1880
message             605
description       12899
caption            9512
post_type             0
status_type          32
likes_count           0
comments_count        0
shares_count          0
love_count            0
wow_count             0
haha_count            0
sad_count             0
thankful_count        0
angry_count           0
link                448
picture             523
posted_at             0
dtype: int64

# **Data Preprocessing**

In [5]:
# Replace null values in specific columns with designated values

# Replace null values in the "name" column with 'unknown'
df['name'].fillna('unknown', inplace=True)

# Replace null values in the "link" column with 'no link'
df['link'].fillna('no link', inplace=True)

# Replace null values in the "picture" column with 'no link'
df['picture'].fillna('no link', inplace=True)

# Replace null values in the "caption" column with 'no caption'
df['caption'].fillna('no caption', inplace=True)

# Replace null values in the "status_type" column with 'others'
df['status_type'].fillna('others', inplace=True)

# Drop rows with null values in the "message" column
df.dropna(subset=['message'], inplace=True)

# Drop rows with null values in the "description"
df.dropna(subset=['description'], inplace=True)

In [6]:
# Check for missing values in the DataFrame and sum them up column-wise
df.isnull().sum()

id                0
page_id           0
name              0
message           0
description       0
caption           0
post_type         0
status_type       0
likes_count       0
comments_count    0
shares_count      0
love_count        0
wow_count         0
haha_count        0
sad_count         0
thankful_count    0
angry_count       0
link              0
picture           0
posted_at         0
dtype: int64

In [7]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

# **NLP**

In [8]:
import nltk
from nltk.tokenize import TweetTokenizer, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
import string
from textblob import TextBlob
import ipywidgets as widgets
from IPython.display import display

# Initialize TweetTokenizer
tokenizer = TweetTokenizer()

# Initialize WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

def pos_tagging_and_lemmatization(text):
    # Tokenize the text
    tokens = word_tokenize(text)

    # Lowercase conversion
    tokens = [word.lower() for word in tokens]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Remove punctuation and special symbols
    tokens = [word for word in tokens if word not in string.punctuation]

    # Perform POS tagging
    pos_tags = pos_tag(tokens)

    # Lemmatization with POS tagging
    lemmatized_tokens = []
    for word, tag in pos_tags:
        pos = tag[0].lower()  # Convert POS tag to lowercase
        pos = pos if pos in ['a', 'n', 'v'] else 'n'  # Map POS tag to WordNet POS tag
        lemma = lemmatizer.lemmatize(word, pos=pos)  # Perform lemmatization
        lemmatized_tokens.append(lemma)

    return ' '.join(lemmatized_tokens)

def preprocess_nlp_columns(df):
    text_columns = [col for col in df.columns if df[col].dtype == 'object']  # Get columns with string data type
    
    # Display message
    message = widgets.Label(value="Please select text column(s) to apply NLP:")
    display(message)
    
    checkbox_options = [widgets.Checkbox(description=col, value=False) for col in text_columns]
    checkbox_group = widgets.HBox(checkbox_options)
    display(checkbox_group)

    def process_columns(b):
        selected_columns = [checkbox_options[i].description for i, option in enumerate(checkbox_options) if option.value]
        if len(selected_columns) == 0:
            print("Please select at least one text column.")
            return
        for column in selected_columns:
            df['tokenized_' + column + '_lemmatized_pos'] = df[column].apply(pos_tagging_and_lemmatization)

    process_button = widgets.Button(description="Process")
    process_button.on_click(process_columns)
    display(process_button)

# Example usage:
preprocess_nlp_columns(df)


Label(value='Please select text column(s) to apply NLP:')

HBox(children=(Checkbox(value=False, description='id'), Checkbox(value=False, description='name'), Checkbox(va…

Button(description='Process', style=ButtonStyle())

This code snippet performs text preprocessing using NLTK and defines a function `pos_tagging_and_lemmatization` to tokenize, remove stopwords, remove punctuation, perform POS tagging, and lemmatize the text. Here's an explanation of each part:

1. **NLTK Imports**:
   - `nltk` is imported, which is a natural language processing library in Python.
   - Various modules from NLTK are imported:
     - `TweetTokenizer`: A tokenizer designed specifically for tweets.
     - `word_tokenize`: A function to tokenize words from sentences.
     - `stopwords`: A corpus of stopwords in different languages.
     - `WordNetLemmatizer`: A lemmatizer based on WordNet.

2. **Initializing Tokenizer and Lemmatizer**:
   - `TweetTokenizer` and `WordNetLemmatizer` objects are initialized.

3. **POS Tagging and Lemmatization Function** (`pos_tagging_and_lemmatization`):
   - This function takes a text input and performs the following steps:
     - Tokenizes the text using `word_tokenize`.
     - Converts tokens to lowercase.
     - Removes stopwords using the English stopwords provided by NLTK.
     - Removes punctuation using the `string.punctuation` module.
     - Performs POS tagging using `pos_tag`.
     - Lemmatizes each token based on its POS tag using `lemmatizer.lemmatize`.
     - Joins the lemmatized tokens back into a string and returns it.

4. **Applying POS Tagging and Lemmatization**:
   - The `pos_tagging_and_lemmatization` function is applied to the 'message' and 'description' column of the DataFrame (`df['message']`,`df['description']`) and the result is stored in a new column named 'tokenized_message_lemmatized_pos',tokenized_description_lemmatized_pos.

Overall, this code prepares text data for further analysis or modeling by tokenizing, removing stopwords and punctuation, performing POS tagging, and lemmatizing the text. This preprocessing step is commonly used in natural language processing tasks to improve the quality of text data for downstream tasks like sentiment analysis, classification, or information retrieval.

In [9]:
df.head()

Unnamed: 0,id,page_id,name,message,description,caption,post_type,status_type,likes_count,comments_count,...,wow_count,haha_count,sad_count,thankful_count,angry_count,link,picture,posted_at,tokenized_message_lemmatized_pos,tokenized_description_lemmatized_pos
1,"﻿""5550296508_258636547563092""",5550296508,How nations risk nuclear terrorism - CNN.com,CNN Opinion Contributor Richard Chasdi states ...,Richard Chasdi says that nations that empower ...,cnn.com,link,shared_story,542,118,...,0,0,0,0,0,http://www.cnn.com/2012/03/26/opinion/chasdi-n...,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-26T21:30:27,cnn opinion contributor richard chasdi state e...,richard chasdi say nation empower proxy group ...
2,"﻿""5550296508_10150712540566509""",5550296508,Facebook wants court to dismiss Ceglia lawsuit,"'Ceglia has forged documents, destroyed eviden...",Paul Ceglia originally filed the attention-gra...,money.cnn.com,link,published_story,185,67,...,0,0,0,0,0,http://cnnmon.ie/HdBpwG,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-27T01:30:45,'ceglia forge document destroy evidence abuse ...,paul ceglia originally file attention-grabbing...
3,"﻿""5550296508_350156181698587""",5550296508,Report: Zimmerman told police teen punched him...,An Orlando Sentinel report fills in some blank...,"A month ago Monday, Trayvon Martin, an unarmed...",cnn.com,link,shared_story,488,3009,...,0,0,0,0,0,http://www.cnn.com/2012/03/26/justice/florida-...,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-27T13:41:26,orlando sentinel report fill blank purportedly...,month ago monday trayvon martin unarm florida ...
4,"﻿""5550296508_140431756086124""",5550296508,Supreme Court divided over health care mandate...,At the core of the health care law is the indi...,The Supreme Court appeared divided Tuesday ove...,cnn.com,link,shared_story,538,782,...,0,0,0,0,0,http://www.cnn.com/2012/03/27/justice/scotus-h...,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-27T17:52:15,core health care law individual mandate provis...,supreme court appear divide tuesday controvers...
7,"﻿""5550296508_344070822311532""",5550296508,Coolest inventions coming in 2012,A 'super Wi-Fi' network; Windows 8; the Lytro ...,These 5 creations look to be some of this year...,money.cnn.com,link,shared_story,676,80,...,0,0,0,0,0,http://money.cnn.com/galleries/2012/technology...,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-27T22:00:35,'super wi-fi network windows 8 lytro camera go...,5 creation look year 's transformative tech pr...


### Combining message and description Column

In [10]:
df['Combined_Column'] = df['tokenized_message_lemmatized_pos'].astype(str) + df['tokenized_description_lemmatized_pos']

In [11]:
df.head()

Unnamed: 0,id,page_id,name,message,description,caption,post_type,status_type,likes_count,comments_count,...,haha_count,sad_count,thankful_count,angry_count,link,picture,posted_at,tokenized_message_lemmatized_pos,tokenized_description_lemmatized_pos,Combined_Column
1,"﻿""5550296508_258636547563092""",5550296508,How nations risk nuclear terrorism - CNN.com,CNN Opinion Contributor Richard Chasdi states ...,Richard Chasdi says that nations that empower ...,cnn.com,link,shared_story,542,118,...,0,0,0,0,http://www.cnn.com/2012/03/26/opinion/chasdi-n...,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-26T21:30:27,cnn opinion contributor richard chasdi state e...,richard chasdi say nation empower proxy group ...,cnn opinion contributor richard chasdi state e...
2,"﻿""5550296508_10150712540566509""",5550296508,Facebook wants court to dismiss Ceglia lawsuit,"'Ceglia has forged documents, destroyed eviden...",Paul Ceglia originally filed the attention-gra...,money.cnn.com,link,published_story,185,67,...,0,0,0,0,http://cnnmon.ie/HdBpwG,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-27T01:30:45,'ceglia forge document destroy evidence abuse ...,paul ceglia originally file attention-grabbing...,'ceglia forge document destroy evidence abuse ...
3,"﻿""5550296508_350156181698587""",5550296508,Report: Zimmerman told police teen punched him...,An Orlando Sentinel report fills in some blank...,"A month ago Monday, Trayvon Martin, an unarmed...",cnn.com,link,shared_story,488,3009,...,0,0,0,0,http://www.cnn.com/2012/03/26/justice/florida-...,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-27T13:41:26,orlando sentinel report fill blank purportedly...,month ago monday trayvon martin unarm florida ...,orlando sentinel report fill blank purportedly...
4,"﻿""5550296508_140431756086124""",5550296508,Supreme Court divided over health care mandate...,At the core of the health care law is the indi...,The Supreme Court appeared divided Tuesday ove...,cnn.com,link,shared_story,538,782,...,0,0,0,0,http://www.cnn.com/2012/03/27/justice/scotus-h...,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-27T17:52:15,core health care law individual mandate provis...,supreme court appear divide tuesday controvers...,core health care law individual mandate provis...
7,"﻿""5550296508_344070822311532""",5550296508,Coolest inventions coming in 2012,A 'super Wi-Fi' network; Windows 8; the Lytro ...,These 5 creations look to be some of this year...,money.cnn.com,link,shared_story,676,80,...,0,0,0,0,http://money.cnn.com/galleries/2012/technology...,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-27T22:00:35,'super wi-fi network windows 8 lytro camera go...,5 creation look year 's transformative tech pr...,'super wi-fi network windows 8 lytro camera go...


# **Sentiment Analysis**

In [12]:
from textblob import TextBlob
import ipywidgets as widgets
from IPython.display import display

# Define a function to perform sentiment analysis on text
def get_sentiment(df, column_name):
    # Create a TextBlob object for the input text
    analysis = df[column_name].apply(lambda x: TextBlob(str(x)))

    # Check the polarity of the sentiment analysis
    sentiment_labels = []
    for blob in analysis:
        if blob.sentiment.polarity > 0:         # If polarity is positive
            sentiment_labels.append('Positive') # Assign 'Positive' sentiment label
        elif blob.sentiment.polarity == 0:      # If polarity is neutral
            sentiment_labels.append('Neutral')  # Assign 'Neutral' sentiment label
        else:                                   # If polarity is negative
            sentiment_labels.append('Negative') # Assign 'Negative' sentiment label
    return sentiment_labels

def sentiment_analysis(df):
    column_names = df.columns.tolist()  # Get all column names from the DataFrame
    dropdown_options = [col for col in column_names]
    dropdown = widgets.Dropdown(options=dropdown_options, description='Select column:')
    display(dropdown)

    def process_sentiment_analysis(b):
        selected_column = dropdown.value
        df['sentiment_label'] = get_sentiment(df, selected_column)
        print("Sentiment analysis applied to column:", selected_column)

    process_button = widgets.Button(description="Apply Sentiment Analysis")
    process_button.on_click(process_sentiment_analysis)
    display(process_button)

# Example usage:
sentiment_analysis(df)


Dropdown(description='Select column:', options=('id', 'page_id', 'name', 'message', 'description', 'caption', …

Button(description='Apply Sentiment Analysis', style=ButtonStyle())

Sentiment analysis applied to column: Combined_Column


1. **Import Necessary Library**:
   - The code imports the required library, TextBlob, which is a popular Python library for processing textual data.

2. **Define Sentiment Analysis Function** (`get_sentiment`):
   - A function named `get_sentiment` is defined to perform sentiment analysis on text.
   - This function takes a text input as its argument.
   - Inside the function, a TextBlob object is created for the input text, which allows for easy access to sentiment analysis functionalities.

3. **Perform Sentiment Analysis**:
   - The sentiment polarity of the input text is checked using the `analysis.sentiment.polarity` attribute of the TextBlob object.
   - If the polarity is greater than 0, it indicates a positive sentiment, and the function returns 'Positive'.
   - If the polarity is equal to 0, it indicates a neutral sentiment, and the function returns 'Neutral'.
   - If the polarity is less than 0, it indicates a negative sentiment, and the function returns 'Negative'.

4. **Apply Sentiment Analysis Function to DataFrame Column**:
   - The sentiment analysis function (`get_sentiment`) is applied to each element in the 'tokenized_message_lemmatized_pos' column of the DataFrame (`df`).
   - The result of the sentiment analysis is stored in a new column named 'sentiment_label' in the DataFrame, which contains the assigned sentiment labels for each message.

Overall, this code snippet demonstrates how to perform sentiment analysis on text data using TextBlob and how to apply this sentiment analysis function to a DataFrame column to assign sentiment labels to each message.

In [13]:
df.head()

Unnamed: 0,id,page_id,name,message,description,caption,post_type,status_type,likes_count,comments_count,...,sad_count,thankful_count,angry_count,link,picture,posted_at,tokenized_message_lemmatized_pos,tokenized_description_lemmatized_pos,Combined_Column,sentiment_label
1,"﻿""5550296508_258636547563092""",5550296508,How nations risk nuclear terrorism - CNN.com,CNN Opinion Contributor Richard Chasdi states ...,Richard Chasdi says that nations that empower ...,cnn.com,link,shared_story,542,118,...,0,0,0,http://www.cnn.com/2012/03/26/opinion/chasdi-n...,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-26T21:30:27,cnn opinion contributor richard chasdi state e...,richard chasdi say nation empower proxy group ...,cnn opinion contributor richard chasdi state e...,Positive
2,"﻿""5550296508_10150712540566509""",5550296508,Facebook wants court to dismiss Ceglia lawsuit,"'Ceglia has forged documents, destroyed eviden...",Paul Ceglia originally filed the attention-gra...,money.cnn.com,link,published_story,185,67,...,0,0,0,http://cnnmon.ie/HdBpwG,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-27T01:30:45,'ceglia forge document destroy evidence abuse ...,paul ceglia originally file attention-grabbing...,'ceglia forge document destroy evidence abuse ...,Negative
3,"﻿""5550296508_350156181698587""",5550296508,Report: Zimmerman told police teen punched him...,An Orlando Sentinel report fills in some blank...,"A month ago Monday, Trayvon Martin, an unarmed...",cnn.com,link,shared_story,488,3009,...,0,0,0,http://www.cnn.com/2012/03/26/justice/florida-...,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-27T13:41:26,orlando sentinel report fill blank purportedly...,month ago monday trayvon martin unarm florida ...,orlando sentinel report fill blank purportedly...,Positive
4,"﻿""5550296508_140431756086124""",5550296508,Supreme Court divided over health care mandate...,At the core of the health care law is the indi...,The Supreme Court appeared divided Tuesday ove...,cnn.com,link,shared_story,538,782,...,0,0,0,http://www.cnn.com/2012/03/27/justice/scotus-h...,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-27T17:52:15,core health care law individual mandate provis...,supreme court appear divide tuesday controvers...,core health care law individual mandate provis...,Positive
7,"﻿""5550296508_344070822311532""",5550296508,Coolest inventions coming in 2012,A 'super Wi-Fi' network; Windows 8; the Lytro ...,These 5 creations look to be some of this year...,money.cnn.com,link,shared_story,676,80,...,0,0,0,http://money.cnn.com/galleries/2012/technology...,https://external.xx.fbcdn.net/safe_image.php?d...,2012-03-27T22:00:35,'super wi-fi network windows 8 lytro camera go...,5 creation look year 's transformative tech pr...,'super wi-fi network windows 8 lytro camera go...,Positive


# **Feature Selection**

In [14]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder, StandardScaler
import ipywidgets as widgets

def feature_extraction(df, text_column, numerical_features, target_column):
    # Text Data Processing
    tokenizer = Tokenizer(num_words=10000)
    tokenizer.fit_on_texts(df[text_column])
    X_text = tokenizer.texts_to_sequences(df[text_column])
    X_text = pad_sequences(X_text, maxlen=100)

    # Adding numerical features if selected
    if numerical_features:
        scaler = StandardScaler()
        X_numerical = df[numerical_features].values
        X_numerical_scaled = scaler.fit_transform(X_numerical)

        # Concatenate text and numerical features
        X = np.concatenate((X_text, X_numerical_scaled), axis=1)
    else:
        X = X_text

    # Encode the target variable
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(df[target_column])

    return X, y

# Get column names for text columns (non-numerical)
text_column_options = [col for col in df.columns if df[col].dtype == 'object']

# Create dropdown widgets for text_column and target_column
text_column_dropdown = widgets.Dropdown(description='Text Column:', options=text_column_options)
target_column_dropdown = widgets.Dropdown(description='Target Column:', options=df.columns)

# Create a button for triggering the feature extraction process
extract_features_button = widgets.Button(description='Extract Features')

# Automatically detect numerical features from the dataset
numerical_features = [col for col in df.columns if df[col].dtype in ['int64', 'float64']]

# Create checkboxes for numerical features
numerical_features_checkboxes = [widgets.Checkbox(description=num_feature) for num_feature in numerical_features]

# Store selected numerical features
selected_numerical_features = []

# Define function to handle button click event
def on_button_clicked(b):
    global selected_numerical_features
    global text_column
    global target_column
    
    text_column = text_column_dropdown.value
    target_column = target_column_dropdown.value
    selected_numerical_features = [checkbox.description for checkbox in numerical_features_checkboxes if checkbox.value]
    X, y = feature_extraction(df, text_column, selected_numerical_features, target_column)
    print(f'Selected text column: {text_column}, Selected target column: {target_column}')
    print(f'Selected numerical features: {selected_numerical_features}')
    print(f'X shape: {X.shape}, y shape: {y.shape}')

# Attach the button click event handler to the button
extract_features_button.on_click(on_button_clicked)

# Create a VBox for numerical features checkboxes
numerical_features_widget = widgets.VBox([widgets.Label("Numerical Features:"), *numerical_features_checkboxes])

# Display dropdowns, checkboxes, and button
display(text_column_dropdown, target_column_dropdown, numerical_features_widget, extract_features_button)


Dropdown(description='Text Column:', options=('id', 'name', 'message', 'description', 'caption', 'post_type', …

Dropdown(description='Target Column:', options=('id', 'page_id', 'name', 'message', 'description', 'caption', …

VBox(children=(Label(value='Numerical Features:'), Checkbox(value=False, description='page_id'), Checkbox(valu…

Button(description='Extract Features', style=ButtonStyle())

Selected text column: Combined_Column, Selected target column: sentiment_label
Selected numerical features: ['likes_count', 'comments_count', 'shares_count']
X shape: (18260, 103), y shape: (18260,)


The `feature_extraction` function is responsible for processing the text data, adding numerical features, and encoding the target variable for machine learning tasks.

1. **Text Data Processing**:
   - The function uses the `Tokenizer` class from TensorFlow Keras to tokenize the text data in the specified `text_column`. It limits the vocabulary size to 10,000 words (`num_words=10000`).
   - The tokenized sequences are then padded to a maximum length of 100 using `pad_sequences`. This ensures that all sequences have the same length for model compatibility.

2. **Adding Numerical Features**:
   - The function scales the numerical features in the DataFrame using `StandardScaler` from scikit-learn. This standardizes the features by removing the mean and scaling to unit variance.

3. **Concatenating Text and Numerical Features**:
   - The tokenized and padded text sequences (`X_text`) are concatenated with the scaled numerical features (`X_numerical_scaled`) along the columns (axis=1) using NumPy's `concatenate` function. This results in a feature matrix `X` containing both text and numerical features.

4. **Encoding the Target Variable**:
   - The target variable (`target_column`) is encoded using `LabelEncoder` from scikit-learn. This assigns a unique integer label to each unique category in the target variable `y`.

5. **Returning Features and Target**:
   - The function returns the feature matrix `X` and the encoded target variable `y`.

Overall, this function prepares the data for machine learning tasks by processing text data, adding numerical features, and encoding the target variable, making it suitable for training machine learning models.

The user is prompted to input the numerical features, which are then used in the feature extraction process. The text column and target column are fixed in this function and are passed as arguments.

# **Model Building**

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout, Input, Concatenate, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

def build_train_evaluate_model(df, text_column, numerical_features, target_column, model_type):
    results = {}
    if model_type.lower() == 'cnn':
        # CNN Model Building
        tokenizer = Tokenizer(num_words=10000)
        tokenizer.fit_on_texts(df[text_column])
        X_text = tokenizer.texts_to_sequences(df[text_column])
        X_text = pad_sequences(X_text, maxlen=100)

        scaler = StandardScaler()
        X_numerical = df[numerical_features].values
        X_numerical_scaled = scaler.fit_transform(X_numerical)

        X = np.concatenate((X_text, X_numerical_scaled), axis=1)

        label_encoder = LabelEncoder()
        y = label_encoder.fit_transform(df[target_column])

        input_text = Input(shape=(100,))
        embedding = Embedding(input_dim=10000, output_dim=50)(input_text)
        conv1d = Conv1D(filters=128, kernel_size=5, activation='relu')(embedding)
        global_max_pooling = GlobalMaxPooling1D()(conv1d)

        input_numerical = Input(shape=(len(numerical_features),))
        concatenated = Concatenate()([global_max_pooling, input_numerical])
        dense1 = Dense(64, activation='relu')(concatenated)
        dropout = Dropout(0.5)(dense1)
        output = Dense(3, activation='softmax')(dropout)

        model = Model(inputs=[input_text, input_numerical], outputs=output)
        model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
        model.summary()

        X_train_text, X_test_text, X_train_numerical, X_test_numerical, y_train, y_test = train_test_split(X_text, X_numerical_scaled, y, test_size=0.2, random_state=42)
        model.fit([X_train_text, X_train_numerical], y_train, epochs=10, batch_size=64, validation_data=([X_test_text, X_test_numerical], y_test))
        y_pred = np.argmax(model.predict([X_test_text, X_test_numerical]), axis=-1)

        # Evaluate the model
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred)

        # Store results
        results['model'] = model_type
        results['accuracy'] = accuracy
        results['report'] = report

    elif model_type.lower() == 'rnn':
        # RNN Model Building
        tokenizer = Tokenizer(num_words=10000)
        tokenizer.fit_on_texts(df[text_column])
        X_text = tokenizer.texts_to_sequences(df[text_column])
        X_text = pad_sequences(X_text, maxlen=100)

        scaler = StandardScaler()
        X_numerical = df[numerical_features].values
        X_numerical_scaled = scaler.fit_transform(X_numerical)

        X = np.concatenate((X_text, X_numerical_scaled), axis=1)

        label_encoder = LabelEncoder()
        y = label_encoder.fit_transform(df[target_column])

        input_text = Input(shape=(100,))
        embedding = Embedding(input_dim=10000, output_dim=50)(input_text)
        lstm = LSTM(64)(embedding)
        dense1 = Dense(64, activation='relu')(lstm)
        dropout = Dropout(0.5)(dense1)
        output = Dense(3, activation='softmax')(dropout)

        model = Model(inputs=input_text, outputs=output)
        model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
        model.summary()

        X_train_text, X_test_text, y_train, y_test = train_test_split(X_text, y, test_size=0.2, random_state=42)
        model.fit(X_train_text, y_train, epochs=10, batch_size=64, validation_data=(X_test_text, y_test))
        y_pred = np.argmax(model.predict(X_test_text), axis=-1)

        # Evaluate the model
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred)

        # Store results
        results['model'] = model_type
        results['accuracy'] = accuracy
        results['report'] = report

    else:
        # Machine Learning Models
        if model_type.lower() == 'svm':
            model = SVC()
        elif model_type.lower() == 'logistic':
            model = LogisticRegression()
        elif model_type.lower() == 'decisiontree':
            model = DecisionTreeClassifier()
        elif model_type.lower() == 'randomforest':
            model = RandomForestClassifier()
        elif model_type.lower() == 'xgboost':
            model = XGBClassifier()
        elif model_type.lower() == 'naivebayes':
            model = GaussianNB()
        elif model_type.lower() == 'knn':
            model = KNeighborsClassifier()
        else:
            raise ValueError("Invalid model type. Please choose from: CNN, RNN, SVM, Logistic, DecisionTree, RandomForest, XGBoost, NaiveBayes, or KNN.")

        # Prepare data
        tokenizer = Tokenizer(num_words=10000)
        tokenizer.fit_on_texts(df[text_column])
        X_text = tokenizer.texts_to_sequences(df[text_column])
        X_text = pad_sequences(X_text, maxlen=100)

        scaler = StandardScaler()
        X_numerical = df[numerical_features].values
        X_numerical_scaled = scaler.fit_transform(X_numerical)

        X = np.concatenate((X_text, X_numerical_scaled), axis=1)

        label_encoder = LabelEncoder()
        y = label_encoder.fit_transform(df[target_column])

        # Train-test split
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Train the model
        model.fit(X_train, y_train)

        # Evaluate the model
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred)

        # Store results
        results['model'] = model_type
        results['accuracy'] = accuracy
        results['report'] = report

    return results

# List of models to compare
model_types = ['cnn', 'rnn', 'svm', 'logistic', 'decisiontree', 'randomforest', 'xgboost', 'naivebayes', 'knn']

# Dictionary to store results
all_results = {}

# Build, train, and evaluate each model
for model_type in model_types:
    results = build_train_evaluate_model(df, text_column, numerical_features, target_column, model_type)
    all_results[model_type] = results


# Find the best model based on the highest accuracy
best_accuracy = float('-inf')
best_model = None
for model_type, result in all_results.items():
    if 'accuracy' in result:
        if result['accuracy'] > best_accuracy:
            best_accuracy = result['accuracy']
            best_model = result['model']

# Print the best model and its accuracy
if best_model is not None:
    print("Best Model:", best_model)
    print("Accuracy:", best_accuracy)
    print("Classification Report:")
    print(all_results[best_model]['report'])
else:
    print("No model achieved accuracy.")

# Print results of all models
for model_type, result in all_results.items():
    print("Model:", model_type)
    if 'accuracy' in result:
        print("Accuracy:", result['accuracy'])
    else:
        print("Accuracy: N/A")
    if 'report' in result:
        print("Classification Report:")
        print(result['report'])
    else:
        print("Classification Report: N/A")


Epoch 1/10
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 28ms/step - accuracy: 0.4833 - loss: 1.0401 - val_accuracy: 0.7859 - val_loss: 0.6076
Epoch 2/10
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 18ms/step - accuracy: 0.8430 - loss: 0.4740 - val_accuracy: 0.8921 - val_loss: 0.3354
Epoch 3/10
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 19ms/step - accuracy: 0.9373 - loss: 0.2151 - val_accuracy: 0.8875 - val_loss: 0.3474
Epoch 4/10
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 19ms/step - accuracy: 0.9706 - loss: 0.1079 - val_accuracy: 0.8836 - val_loss: 0.4209
Epoch 5/10
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 27ms/step - accuracy: 0.9862 - loss: 0.0568 - val_accuracy: 0.8825 - val_loss: 0.4831
Epoch 6/10
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 24ms/step - accuracy: 0.9924 - loss: 0.0286 - val_accuracy: 0.8839 - val_loss: 0.5539
Epoch 7/10
[1m229/229

Epoch 1/10
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 60ms/step - accuracy: 0.5169 - loss: 0.9967 - val_accuracy: 0.6985 - val_loss: 0.6976
Epoch 2/10
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 50ms/step - accuracy: 0.8113 - loss: 0.5080 - val_accuracy: 0.8384 - val_loss: 0.4672
Epoch 3/10
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 42ms/step - accuracy: 0.9386 - loss: 0.2043 - val_accuracy: 0.8434 - val_loss: 0.5465
Epoch 4/10
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 41ms/step - accuracy: 0.9675 - loss: 0.1213 - val_accuracy: 0.8445 - val_loss: 0.5532
Epoch 5/10
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 47ms/step - accuracy: 0.9777 - loss: 0.0819 - val_accuracy: 0.8486 - val_loss: 0.6218
Epoch 6/10
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 61ms/step - accuracy: 0.9825 - loss: 0.0649 - val_accuracy: 0.8442 - val_loss: 0.7575
Epoch 7/10
[1m22

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Best Model: cnn
Accuracy: 0.8792442497261774
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.78      0.81       801
           1       0.89      0.88      0.89      1119
           2       0.89      0.92      0.90      1732

    accuracy                           0.88      3652
   macro avg       0.87      0.86      0.87      3652
weighted avg       0.88      0.88      0.88      3652

Model: cnn
Accuracy: 0.8792442497261774
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.78      0.81       801
           1       0.89      0.88      0.89      1119
           2       0.89      0.92      0.90      1732

    accuracy                           0.88      3652
   macro avg       0.87      0.86      0.87      3652
weighted avg       0.88      0.88      0.88      3652

Model: rnn
Accuracy: 0.8406352683461117
Classification Report:
              precision    recall  f1-score 