In [100]:

import pandas as pd
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
import numpy as np
import warnings

# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [101]:


# Load and preprocess the data
df = pd.read_csv('Desktop/video_ubisoft/youtuber_impact.csv', encoding='utf-8', sep=';')

# Convert the columns to categorical sentiment values
df['sentiment_of_comments'] = df['sentiment_of_comments'].str.strip().str.lower().replace({'negative': -1, 'neutral': 0, 'positive': 1}).infer_objects(copy=False)
df['youtuber_sentiment'] = df['youtuber_sentiment'].str.strip().str.lower().replace({'negative': -1, 'neutral': 0, 'positive': 1}).infer_objects(copy=False)

# Display the first few rows to verify the transformation
print(df.head())



                                         video_title  sentiment_of_comments  \
0              The Last of Us Part II - Angry Review                     -1   
1                Is Kingdom Come Deliverance Racist?                     -1   
2                                     Concord Review                      0   
3  So far I do not recommend: Battlefield 2042 (R...                     -1   
4                                  Balan Wonderworld                     -1   

   youtuber_sentiment                                              Notes  
0                  -1  this game had so much more to live up to when ...  
1                  -1                                    fuking terrible  
2                   0                                        6 out of 10  
3                  -1        need a lot of work in design not reccomend   
4                  -1                                   this game sucks   


In [102]:
# Pearson Correlation
correlation, p_value = pearsonr(df['youtuber_sentiment'], df['sentiment_of_comments'])
print(f"Pearson correlation: {correlation}")
print(f"P-value: {p_value}")


Pearson correlation: 0.9513381762654806
P-value: 1.2172399897818896e-10


**Pearson Correlation and Linear Regression**
Pearson Correlation: The correlation coefficient is 0.951, with a p-value of 1.22e-10, indicating a very strong and statistically significant positive correlation between YouTuber sentiment and audience sentiment.

In [104]:
# Linear Regression Model
X = np.array(df['youtuber_sentiment']).reshape(-1, 1)
y = np.array(df['sentiment_of_comments'])
linear_model = LinearRegression()
linear_model.fit(X, y)
print(f"Linear Regression Coefficient: {linear_model.coef_[0]}")
print(f"Linear Regression Intercept: {linear_model.intercept_}")
print(f"R-squared: {linear_model.score(X, y)}")

Linear Regression Coefficient: 1.0058997050147493
Linear Regression Intercept: 0.10029498525073748
R-squared: 0.9050443256201306


**Linear Regression:**
Coefficient: 1.0059, meaning that a unit increase in YouTuber sentiment is associated with a nearly equivalent increase in audience sentiment.
Intercept: 0.1003, a small offset that doesn’t substantially impact the relationship.
R-squared: 0.905, meaning 90.5% of the variance in audience sentiment can be explained by YouTuber sentiment, indicating an excellent fit.
Our analysis suggests a strong linear relationship between YouTuber sentiment and audience sentiment, making it an effective initial model.

In [106]:
# Additional code for Multinomial Logistic Regression and Ordinal Logistic Regression

# Multinomial Logistic Regression Model
from sklearn.linear_model import LogisticRegression
# Adjust encoding for Logistic Regression (encode as 0 = negative, 1 = neutral, 2 = positive)
df['logistic_sentiment_of_comments'] = df['sentiment_of_comments'].replace({-1: 0, 0: 1, 1: 2})
df['logistic_youtuber_sentiment'] = df['youtuber_sentiment'].replace({-1: 0, 0: 1, 1: 2})
X_logistic = df[['logistic_youtuber_sentiment']]
y_logistic = df['logistic_sentiment_of_comments']
logistic_model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
logistic_model.fit(X_logistic, y_logistic)
print("Multinomial Logistic Regression Coefficients:", logistic_model.coef_)
print("Multinomial Logistic Regression Intercept:", logistic_model.intercept_)
print("Multinomial Logistic Regression Accuracy:", logistic_model.score(X_logistic, y_logistic))



Multinomial Logistic Regression Coefficients: [[-1.47435582]
 [ 0.10821212]
 [ 1.3661437 ]]
Multinomial Logistic Regression Intercept: [ 1.72467683 -1.1846857  -0.53999113]
Multinomial Logistic Regression Accuracy: 0.95


**Multinomial Logistic Regression**

The coefficients vary for each category (negative, neutral, positive), representing the effect of YouTuber sentiment on the probability of each sentiment category.
Intercepts: These values adjust the probability distribution across categories.
Accuracy: 95%, showing a high accuracy in predicting the sentiment category in comments based on YouTuber sentiment.
Our Multinomial logistic regression is suitable for categorical analysis, capturing the probability of each discrete sentiment category without assuming an ordered relationship.

In [108]:
# Ordinal Logistic Regression Model
# Note: Install mord library if not already installed
# !pip install mord
from mord import LogisticAT
X_ordinal = df[['logistic_youtuber_sentiment']].values
y_ordinal = df['logistic_sentiment_of_comments'].values
ordinal_model = LogisticAT()
ordinal_model.fit(X_ordinal, y_ordinal)
print("Ordinal Logistic Regression Coefficients:", ordinal_model.coef_)

# Predictions and Classification Report for Ordinal Logistic Regression
from sklearn.metrics import classification_report
y_ordinal_pred = ordinal_model.predict(X_ordinal)
print("Classification Report for Ordinal Logistic Regression:\n", 
      classification_report(y_ordinal, y_ordinal_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0))

Ordinal Logistic Regression Coefficients: [2.63752154]
Classification Report for Ordinal Logistic Regression:
               precision    recall  f1-score   support

    Negative       1.00      1.00      1.00         9
     Neutral       0.00      0.00      0.00         1
    Positive       0.91      1.00      0.95        10

    accuracy                           0.95        20
   macro avg       0.64      0.67      0.65        20
weighted avg       0.90      0.95      0.93        20



**Ordinal Logistic Regression**

Coefficient: 2.6375, indicating a positive association between YouTuber sentiment and the likelihood of a more positive audience sentiment.
Classification Report:
Accuracy: 95%, similar to the multinomial logistic regression model.
Precision and Recall: High precision and recall for "Negative" and "Positive" categories but zero for "Neutral," likely due to an imbalance (only one neutral instance in the dataset).
Our Ordinal logistic regression is appropriate when treating sentiment as an ordered variable (negative < neutral < positive), making it suitable if we assume that audience sentiment follows a progression from negative to positive.

# **Which Analysis is Most Relevant and Why?**
The Pearson correlation and linear regression give a straightforward interpretation of the linear relationship between YouTuber and audience sentiment. However, ordinal logistic regression is arguably the most relevant model here, as it treats sentiment as an ordered variable, which is often more accurate for sentiment analysis. This model leverages the ordinal nature of the data, capturing the likelihood that an increase in YouTuber sentiment results in an audience sentiment that is also more positive. The high accuracy of both logistic models further supports the robustness of this approach.

# **Why We Added Additional Analyses (Logistic Regression and Ordinal Regression)**
While the Pearson correlation and linear regression provided strong evidence of a linear relationship between YouTuber sentiment and audience sentiment, these models treated sentiment as a continuous variable. To fully capture the nature of sentiment data, it’s essential to consider it as categorical (positive, neutral, negative) rather than continuous. Logistic regression (multinomial or ordinal) models the probability of each discrete sentiment category in the audience based on the YouTuber’s sentiment, allowing for a more nuanced analysis of the relationship.

# **CONCLUSION**

The results from all models-correlation and linear regression, multinomial logistic regression, and ordinal logistic regression—consistently suggest a strong relationship between YouTuber sentiment and audience sentiment. **This alignment indicates that the sentiment expressed by the YouTuber is a major explanatory factor for the sentiment seen in the comments, supporting the hypothesis that the YouTuber’s opinion significantly shapes or influences audience reactions.**