# Text Analysis Assignment

## Objective
The goal of this assignment is to assess your ability to perform text analysis using Python libraries, focusing on regular expressions and other text processing techniques.

## Dataset
For this assignment, we will use the [Twitter Sentiment Analysis Dataset](https://www.kaggle.com/kazanova/sentiment140).  

Note: This lab will not work on older versions of Python; make sure to work on Googe colab

**This assignment should be completed and submitted by 11:59 PM PST on Friday, Oct 25th, 2024**

## Step 1: Environment Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import zipfile
import requests

# Download NLTK resources if needed
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### Step 2: Load the Dataset from URL


*   First, we need to download and extract the dataset




In [2]:
# Download and extract the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
response = requests.get(url)

# Save the zip file locally
with open('smsspamcollection.zip', 'wb') as file:
    file.write(response.content)

# Extract the zip file
with zipfile.ZipFile('smsspamcollection.zip', 'r') as zip_ref:
    zip_ref.extractall()

# Load the dataset into a DataFrame (the file name is 'SMSSpamCollection')
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])

# Preview the data
print(df.head())

NameError: name 'requests' is not defined

## Question 1: Data Overview
Display the shape of the dataset and check for any missing values.

In [1]:
#Answer here
# Display the shape of the dataset and check for any missing values
print(f"Shape of the dataset: {df.shape}")
print(f"Missing values:\n{df.isnull().sum()}")


NameError: name 'df' is not defined

## Question 2: Data Cleaning
Write a function to clean the text data by removing URLs and converting text to lowercase.

In [None]:
# Function to clean the text data
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Convert to lowercase
    text = text.lower()
    return text

# Apply the cleaning function to the 'message' column
df['cleaned_message'] = df['message'].apply(clean_text)

# Preview cleaned data
print(df[['message', 'cleaned_message']].head())
#Answer here

### Question 3: Tokenization
Tokenize the cleaned text into words and display the first five tokens from one entry.

In [None]:
# Tokenize the cleaned text into words
df['tokenized_message'] = df['cleaned_message'].apply(word_tokenize)

# Display the first five tokens from one entry
print(df['tokenized_message'].iloc[0][:5])


### Question 4: Removing Stop Words
Remove common stop words from the tokenized words and display the remaining words for one entry.

In [None]:
# Remove stop words from tokenized words
stop_words = set(stopwords.words('english'))

df['no_stopwords_message'] = df['tokenized_message'].apply(lambda x: [word for word in x if word not in stop_words])

# Display remaining words for one entry
print(df['no_stopwords_message'].iloc[0])


### Question 5: Word Frequency Distribution
1. Create a frequency distribution of words in the cleaned text and plot it using Matplotlib.
2. Focus on displaying only the top 10 most common words.

In [None]:
# Flatten the list of words for the entire dataset
all_words = [word for tokens in df['no_stopwords_message'] for word in tokens]

# Create frequency distribution
freq_dist = nltk.FreqDist(all_words)

# Plot the top 10 most common words
common_words = freq_dist.most_common(10)
words, counts = zip(*common_words)

plt.figure(figsize=(10, 5))
plt.bar(words, counts)
plt.title('Top 10 Most Common Words')
plt.show()


### Question 6: Label Distribution
1. Analyze the distribution of spam and ham messages in the dataset. Create a bar plot to visualize the counts of each label.

In [None]:
# Analyze the distribution of spam and ham messages
label_counts = df['label'].value_counts()

# Create a bar plot to visualize the counts of each label
plt.figure(figsize=(7, 5))
label_counts.plot(kind='bar', color=['blue', 'orange'])
plt.title('Distribution of Spam vs Ham Messages')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()


### Question 7: Length of Messages
1. Calculate the length of each message (in terms of character count) and add this as a new column to the DataFrame. Then, visualize the distribution of message lengths using a histogram.

In [None]:
# Calculate the length of each message
df['message_length'] = df['message'].apply(len)

# Visualize the distribution of message lengths using a histogram
plt.figure(figsize=(10, 5))
plt.hist(df['message_length'], bins=20, color='green')
plt.title('Distribution of Message Lengths')
plt.xlabel('Message Length (Characters)')
plt.ylabel('Frequency')
plt.show()


### Question 8: Most Common Words in Spam vs. Ham
1. Identify and display the most common words in spam messages compared to ham messages. Use a simple frequency count for this analysis

In [None]:
# Separate spam and ham messages
spam_words = [word for tokens in df[df['label'] == 'spam']['no_stopwords_message'] for word in tokens]
ham_words = [word for tokens in df[df['label'] == 'ham']['no_stopwords_message'] for word in tokens]

# Create frequency distributions for spam and ham words
spam_freq_dist = nltk.FreqDist(spam_words)
ham_freq_dist = nltk.FreqDist(ham_words)

# Display the most common words in spam and ham messages
print("Most common words in spam messages:", spam_freq_dist.most_common(5))
print("Most common words in ham messages:", ham_freq_dist.most_common(5))


### Question 9: Word Cloud Visualization
1. Create a word cloud for the cleaned text of all messages. This visualization will help you see which words are most prominent in the dataset.

In [None]:
from wordcloud import WordCloud

# Combine all the cleaned messages into one large string
all_text = ' '.join(df['cleaned_message'])

# Generate word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_text)

# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

### Question 10: Sentiment Analysis (Basic)
1. Perform a basic sentiment analysis by checking for specific keywords (e.g., "free", "win", "call", etc.) in spam messages.
2. Count how many spam messages contain at least one of these keywords.

In [None]:
# Define spam-related keywords
keywords = ["free", "win", "call", "prize", "money", "urgent"]

# Count messages that contain at least one of these keywords
spam_messages_containing_keywords = [msg for msg in df[df['label'] == 'spam']['cleaned_message'] if any(keyword in msg.lower() for keyword in keywords)]
spam_keyword_count = len(spam_messages_containing_keywords)

print(f"Number of spam messages containing keywords: {spam_keyword_count}")


### Reflection on Findings:
1. Reflect on what you learned from this assignment.
2. Discuss any insights gained from analyzing spam versus ham messages, including patterns you observed or challenges you faced during your analysis.

In [None]:
From this assignment, I have learned the importance of preprocessing steps such as cleaning, tokenization, and stopword removal in text analysis. By analyzing the data, I found that spam messages often contain certain keywords such as "free", "win", and "prize", which can be used for simple sentiment analysis. 

I also observed that spam messages tend to be more promotional and action-oriented, while ham messages are more conversational. One of the challenges was dealing with noisy data, including special characters and varying formats, but regular expressions and careful text cleaning made the analysis smoother.


### Submission
1. Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output.  
2. name your noteboob with your Roll number(e.g 2022cs_01_assignment2)
3. Upload your notebook and make sure to turn in before dealine.