# Sentiment Analysis Using Python

A step-by-step guide to performing sentiment analysis.

## Introduction to Sentiment Analysis

*Explanation of sentiment analysis and its applications.*

In [None]:
# Install necessary libraries
# !pip install numpy pandas scikit-learn nltk matplotlib

## Setting Up the Environment

*Importing necessary libraries.*

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import nltk

## Data Collection and Preprocessing

*Loading and preprocessing data.*

In [None]:
# Data loading and preprocessing code goes here
# Load the public wiki data
# https://www.kaggle.com/jrobischon/wikipedia-movie-plots
wiki_data = pd.read_csv('wiki_movie_plots_deduped.csv')
wiki_data = wiki_data[['Plot', 'Genre']]
# Remove rows with missing values
wiki_data = wiki_data.dropna()
# Remove rows with multiple genres
wiki_data = wiki_data[wiki_data['Genre'].str.contains(',') == False]
# Remove rows with genres that are not in the top 10
top_10_genres = ['drama', 'comedy', 'horror', 'action', 'thriller', 'romance', 'western', 'crime', 'adventure', 'musical']
wiki_data = wiki_data[wiki_data['Genre'].isin(top_10_genres)]
# Remove rows with plots that are less than 100 words
wiki_data['Plot_Length'] = wiki_data['Plot'].str.split().str.len()
wiki_data = wiki_data[wiki_data['Plot_Length'] >= 100]
# Remove rows with plots that are more than 500 words
wiki_data = wiki_data[wiki_data['Plot_Length'] <= 500]
# Remove rows with plots that are less than 100 words
wiki_data = wiki_data[wiki_data['Plot_Length'] >= 100]

# Split the data into training and testing sets
train_data, test_data = train_test_split(wiki_data, test_size=0.2, random_state=42)



## Exploratory Data Analysis

*Visualizing and understanding the data.*

In [None]:
# Exploratory data analysis code goes here
# Plot the distribution of genres
plt.figure(figsize=(10, 5))
wiki_data['Genre'].value_counts().plot(kind='bar')
plt.title('Distribution of Genres')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.show()

# Plot the distribution of plot lengths
plt.figure(figsize=(10, 5))
wiki_data['Plot_Length'].hist()
plt.title('Distribution of Plot Lengths')


## Training a Sentiment Analysis Model

*Model training and testing.*

In [None]:
# Model training code goes here
# Import the necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create the bag of words vectorizer
vectorizer = CountVectorizer(stop_words='english')
# Fit the vectorizer to the training data
vectorizer.fit(train_data['Plot'])
# Transform the training data
train_x = vectorizer.transform(train_data['Plot'])
# Transform the testing data
test_x = vectorizer.transform(test_data['Plot'])


## Model Evaluation

*Evaluating the model's performance.*

In [None]:
# Model evaluation code goes here
# Create the logistic regression model
model = LogisticRegression()
# Fit the model to the training data
model.fit(train_x, train_data['Genre'])
# Predict the genre of the testing data
predictions = model.predict(test_x)
# Calculate the accuracy of the model
accuracy = accuracy_score(test_data['Genre'], predictions)
print('Model Accuracy: ' + str(accuracy))


## Model Improvement

*Improving and tuning the model.*

In [None]:
# Model improvement code goes here
# Create the bag of words vectorizer
vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 2))
# Fit the vectorizer to the training data
vectorizer.fit(train_data['Plot'])
# Transform the training data
train_x = vectorizer.transform(train_data['Plot'])
# Transform the testing data
test_x = vectorizer.transform(test_data['Plot'])
# Create the logistic regression model
model = LogisticRegression()
# Fit the model to the training data
model.fit(train_x, train_data['Genre'])
# Predict the genre of the testing data
predictions = model.predict(test_x)
# Calculate the accuracy of the model
accuracy = accuracy_score(test_data['Genre'], predictions)
print('Model Accuracy: ' + str(accuracy))


## Visualizing Results

*Visualizing the classification results.*

In [None]:
# Result visualization code goes here
# Create a confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(test_data['Genre'], predictions)
# Create a dataframe from the confusion matrix
confusion_matrix_df = pd.DataFrame(confusion_matrix, index=top_10_genres, columns=top_10_genres)
# Plot the confusion matrix
plt.figure(figsize=(10, 5))
plt.title('Confusion Matrix')
plt.xlabel('Predicted Genre')
plt.ylabel('Actual Genre')
plt.imshow(confusion_matrix_df, cmap='coolwarm', interpolation='nearest')
plt.show()


## Saving and Deploying the Model

*How to save and deploy the model.*

In [None]:
# Model saving code goes here
# Save the model
import pickle
pickle.dump(model, open('model.pkl', 'wb'))
# Save the vectorizer
pickle.dump(vectorizer, open('vectorizer.pkl', 'wb'))

#Deployment code goes here
# Load the model
model = pickle.load(open('model.pkl', 'rb'))
# Load the vectorizer
vectorizer = pickle.load(open('vectorizer.pkl', 'rb'))

# Create a function to predict the genre of a movie plot
def predict_genre(plot):
    # Transform the plot using the vectorizer
    plot = vectorizer.transform([plot])
    # Predict the genre using the model
    genre = model.predict(plot)[0]
    # Return the genre
    return genre

## Conclusion

*Summary and discussion of the project.*





## References

*List of references and further reading.*