## Data loading

### Subtask:
Load the Youtube Comments dataset.


In [15]:
import pandas as pd

try:
    df = pd.read_csv('YoutubeCommentsDataSet.csv')
    display(df.head())
    print(df.shape)
except FileNotFoundError:
    print("Error: 'YoutubeCommentsDataSet.csv' not found. Please ensure the file exists in the current directory.")
except Exception as e:
    print(f"An error occurred: {e}")

Unnamed: 0,Comment,Sentiment
0,lets not forget that apple pay in 2014 require...,neutral
1,here in nz 50 of retailers don’t even have con...,negative
2,i will forever acknowledge this channel with t...,positive
3,whenever i go to a place that doesn’t take app...,negative
4,apple pay is so convenient secure and easy to ...,positive


(18408, 2)


## Data exploration

### Subtask:
Explore the loaded dataset to understand its characteristics.


**Reasoning**:
Examine data types, descriptive statistics, missing values, unique values, duplicates, and the shape of the DataFrame.



In [16]:
# Examine data types
print("Data Types:\n", df.dtypes)

# Descriptive statistics for numerical columns (if any)
print("\nDescriptive Statistics:\n", df.describe(include='all'))

# Check for missing values
print("\nMissing Values:\n", df.isnull().sum())

# Examine unique values and string length distribution
for col in df.columns:
    print(f"\nUnique values in '{col}':")
    print(df[col].unique())
    if df[col].dtype == 'object':  # Check for string columns
        print(f"String length distribution in '{col}':")
        print(df[col].astype(str).str.len().describe())


# Check for duplicate rows
duplicate_rows = df[df.duplicated()]
print(f"\nNumber of duplicate rows: {len(duplicate_rows)}")
print("\nDuplicate Rows:\n", duplicate_rows)

# Print the shape of the dataframe
print("\nShape of the DataFrame:\n", df.shape)

Data Types:
 Comment      object
Sentiment    object
dtype: object

Descriptive Statistics:
                                                   Comment Sentiment
count                                               18364     18408
unique                                              17871         3
top     one of the best thing about dude is that he ne...  positive
freq                                                   10     11432

Missing Values:
 Comment      44
Sentiment     0
dtype: int64

Unique values in 'Comment':
['lets not forget that apple pay in 2014 required a brand new iphone in order to use it a significant portion of apples user base wasnt able to use it even if they wanted to as each successive iphone incorporated the technology and older iphones were replaced the number of people who could use the technology increased'
 'here in nz 50 of retailers don’t even have contactless credit card machines like paywave which support apple pay they don’t like the high fees that come

## Data cleaning

### Subtask:
Clean the data by handling missing values, removing duplicates, and addressing inconsistencies.


**Reasoning**:
Handle missing values in the 'Comment' column by replacing them with empty strings. Remove duplicate rows. Standardize the 'Sentiment' column to lowercase.



In [17]:
# Fill missing 'Comment' values with empty strings
df['Comment'].fillna('', inplace=True)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Standardize 'Sentiment' to lowercase
df['Sentiment'] = df['Sentiment'].str.lower()

display(df.head())
print(df.shape)

Unnamed: 0,Comment,Sentiment
0,lets not forget that apple pay in 2014 require...,neutral
1,here in nz 50 of retailers don’t even have con...,negative
2,i will forever acknowledge this channel with t...,positive
3,whenever i go to a place that doesn’t take app...,negative
4,apple pay is so convenient secure and easy to ...,positive


(17877, 2)


**Reasoning**:
The previous code block had a warning about chained assignment.  I will rewrite the code to avoid this warning and verify the data cleaning.



In [18]:
# Fill missing 'Comment' values with empty strings
df['Comment'] = df['Comment'].fillna('')

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Standardize 'Sentiment' to lowercase
df['Sentiment'] = df['Sentiment'].str.lower()

display(df.head())
print(df.shape)

# Check for missing values again
print("\nMissing Values:\n", df.isnull().sum())

# Check for duplicates again
duplicate_rows = df[df.duplicated()]
print(f"\nNumber of duplicate rows: {len(duplicate_rows)}")

Unnamed: 0,Comment,Sentiment
0,lets not forget that apple pay in 2014 require...,neutral
1,here in nz 50 of retailers don’t even have con...,negative
2,i will forever acknowledge this channel with t...,positive
3,whenever i go to a place that doesn’t take app...,negative
4,apple pay is so convenient secure and easy to ...,positive


(17877, 2)

Missing Values:
 Comment      0
Sentiment    0
dtype: int64

Number of duplicate rows: 0


## Data preparation

### Subtask:
Prepare the cleaned data for further analysis or modeling.


**Reasoning**:
Prepare the cleaned data for further analysis or modeling by converting the 'Sentiment' column to numerical values using Label Encoding, as it's a categorical feature, and then summarize the data preparation steps. No further feature engineering is needed at this stage.



In [19]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'Sentiment' column
df['Sentiment'] = label_encoder.fit_transform(df['Sentiment'])

# Display the first few rows of the DataFrame to show the changes
display(df.head())

# Summarize the data preparation steps
print("\nData Preparation Summary:")
print("1. Label Encoding: The 'Sentiment' column, which is categorical, was converted into numerical representations using LabelEncoder. This is necessary for most machine learning algorithms that require numerical input.")
print("    - 'negative' was encoded to 0")
print("    - 'neutral' was encoded to 1")
print("    - 'positive' was encoded to 2")


Unnamed: 0,Comment,Sentiment
0,lets not forget that apple pay in 2014 require...,1
1,here in nz 50 of retailers don’t even have con...,0
2,i will forever acknowledge this channel with t...,2
3,whenever i go to a place that doesn’t take app...,0
4,apple pay is so convenient secure and easy to ...,2



Data Preparation Summary:
1. Label Encoding: The 'Sentiment' column, which is categorical, was converted into numerical representations using LabelEncoder. This is necessary for most machine learning algorithms that require numerical input.
    - 'negative' was encoded to 0
    - 'neutral' was encoded to 1
    - 'positive' was encoded to 2


## Summary:

### 1. Q&A

No questions were explicitly asked in the provided text. However, the overall task implied a desire to understand and prepare the Youtube Comments dataset for analysis or modeling.

### 2. Data Analysis Key Findings

* **Missing Data:** 44 missing comments were initially present and filled with empty strings.  After cleaning, no missing values remained.
* **Duplicate Data:** 531 duplicate rows were identified and removed. After cleaning, no duplicate rows remained.
* **Sentiment Distribution:** The dataset contains comments labeled with three sentiments: 'negative', 'neutral', and 'positive'. The most frequent sentiment is "positive".
* **Comment Length:** Comment lengths vary significantly, ranging from 2 to 7,847 characters, with an average of 177 characters.
* **Label Encoding:** The 'Sentiment' column was successfully converted from categorical labels ('negative', 'neutral', 'positive') to numerical representations (0, 1, 2 respectively) using Label Encoding.  This is essential for using the data in machine learning models.
* **Dataframe Shape After Cleaning:** The cleaned dataframe contains 17,877 rows and 2 columns.

### 3. Insights or Next Steps

* **Exploratory Data Analysis (EDA):** Conduct more in-depth EDA on the cleaned data. This could involve visualizing the distribution of sentiments, exploring relationships between comment length and sentiment, and performing word frequency analysis on the comments.
* **Model Training:** Apply machine learning models (e.g., text classification models) to the prepared data to predict sentiment based on comment text.  This could involve splitting the data into training and testing sets, and evaluating model performance.


In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [21]:
# 1. Prepare Data:
X = df['Comment']  # Features (comments)
y = df['Sentiment']  # Target (sentiment labels)

In [22]:
# 2. Split Data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [23]:
# 3. Feature Extraction (TF-IDF):
vectorizer = TfidfVectorizer(max_features=5000)  # Adjust max_features as needed
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [24]:
# 4. Train Model (Logistic Regression):
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [25]:
# 5. Predict and Evaluate:
y_pred = model.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7516778523489933


In [26]:
# Function to predict sentiment of input text:
def predict_sentiment(input_text):
    input_text_tfidf = vectorizer.transform([input_text])
    prediction = model.predict(input_text_tfidf)[0]

    if prediction == 0:  # Assuming 'negative' is encoded as 0
        return "negative"
    elif prediction == 1: # Assuming 'neutral' is encoded as 1
        return "neutral"
    else: # Assuming 'positive' is encoded as 2
        return "positive"

In [27]:
# Input from the user:
user_input = input("Enter your text: ")

# Predict and print the sentiment:
predicted_sentiment = predict_sentiment(user_input)
print(f"Sentiment of '{user_input}': {predicted_sentiment}")

Enter your text:  hi what is your namee


Sentiment of 'hi what is your namee': neutral


In [14]:
import pickle

# ... (Your model training code) ...

# Save the model:
with open('sentiment_model.pickle', 'wb') as f:
    pickle.dump(model, f)

# Save the vectorizer:
with open('vectorizer.pickle', 'wb') as f:
    pickle.dump(vectorizer, f)