# Week 12 Lab Assignment: Predictive Analytics with Text, Sentiment Mining

### Objective
In this lab, you will explore text mining techniques, perform sentiment analysis, and apply topic modeling and named entity recognition to textual data. The lab focuses on text preprocessing, using text representation methods, and extracting insights from text data.

### 1. Setup and Installations
**Objective:** Ensure all necessary packages are installed and imported for the lab.

**Tasks:**
1. Install required Python packages: Scikit-learn, Pandas, Numpy, Matplotlib, Seaborn, NLTK, and Gensim.

In [1]:
# Install necessary packages
%pip install scikit-learn pandas numpy matplotlib seaborn nltk gensim

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


### 2. Import Libraries
**Objective:** Import all necessary libraries for data manipulation, text processing, modeling, and visualization.


In [2]:
# Import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim import corpora, models
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
%matplotlib inline

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jason\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jason\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jason\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 3. Load and Explore Text Data
**Objective:** Gain a preliminary understanding of the text dataset to be used for analysis.

**Tasks:**
1. **Load the Dataset:** Import the dataset into a Pandas DataFrame.
2. **Inspect the Data:** Use Pandas functions to inspect the first few rows, check for missing values, and understand the data types.

In [3]:
# Load the dataset
df = pd.read_csv('text_data.csv')

# Inspect the first few rows
print(df.head())

# Check for missing values
print(df.isnull().sum())

                                                text     label
0  The product quality is outstanding and I'm ver...  positive
1  I had a terrible experience with customer serv...  negative
2  Delivery was fast, and the items were well-pac...  positive
3  The website is difficult to navigate and confu...  negative
4  I'm impressed with the variety of products ava...  positive
text     0
label    0
dtype: int64


### 4. Text Preprocessing
**Objective:** Prepare the text data by cleaning and tokenizing it.

**Tasks:**
1. **Tokenization:** Split text into individual words or tokens.
2. **Stop Words Removal:** Use NLTK to remove common stop words from the text.
3. **Lemmatization:** Reduce words to their root form using WordNetLemmatizer.

In [4]:
# Text preprocessing
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

print(df.columns)

def preprocess_text(text):
    # Check if the input is a valid string
    if isinstance(text, str):
        # Tokenize the text
        tokens = word_tokenize(text.lower())  
        # Lemmatize and remove stopwords
        tokens = [lemmatizer.lemmatize(t) for t in tokens if t not in stop_words and t.isalpha()]  
        return ' '.join(tokens)  # Join the tokens back into a string
    else:
        return ''  # Return an empty string for non-string inputs

# Check if 'text' column has non-string values
print(df['text'].apply(lambda x: type(x)).value_counts())

# Fill missing values with an empty string
df['text'] = df['text'].fillna('')

print(df[df['text'].isnull()])

# # Apply the preprocessing
# df['cleaned_text'] = df['text'].apply(preprocess_text)
# Apply the preprocessing using a list comprehension
# Initialize an empty list to store the cleaned text
cleaned_texts = []

# Iterate over each text in the 'text' column
for text in df['text']:
    # Preprocess the text
    cleaned_text = preprocess_text(text)
    # Append the cleaned text to the list
    cleaned_texts.append(cleaned_text)

# Assign the cleaned texts to a new column in the dataframe
df['cleaned_text'] = cleaned_texts
print(df['cleaned_text'].head())

Index(['text', 'label'], dtype='object')
text
<class 'str'>    20
Name: count, dtype: int64
Empty DataFrame
Columns: [text, label]
Index: []


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\jason/nltk_data'
    - 'c:\\Program Files (x86)\\Microsoft Visual Studio\\Shared\\Python39_64\\nltk_data'
    - 'c:\\Program Files (x86)\\Microsoft Visual Studio\\Shared\\Python39_64\\share\\nltk_data'
    - 'c:\\Program Files (x86)\\Microsoft Visual Studio\\Shared\\Python39_64\\lib\\nltk_data'
    - 'C:\\Users\\jason\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


### 5. Sentiment Analysis
**Objective:** Build and evaluate a model to classify text based on sentiment (positive, negative).

**Tasks:**
1. **Vectorization:** Convert text into numerical features using TF-IDF.
2. **Model Training:** Train a Logistic Regression model on the text data.
3. **Evaluation:** Evaluate the model using accuracy and classification report.

In [13]:
# Vectorization
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['cleaned_text']).toarray()
y = df['label']  # Assuming 'label' is the target column

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f'Training set size: {X_train.shape}')
print(f'Test set size: {X_test.shape}')

# Train a Logistic Regression model
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)

# Evaluate the model
print('Logistic Regression Classification Report')
print(classification_report(y_test, lr_pred))

KeyError: 'cleaned_text'

### 6. Topic Modeling with LDA
**Objective:** Identify topics within a set of documents using Latent Dirichlet Allocation (LDA).

**Tasks:**
1. **Prepare Text Data:** Tokenize and remove stop words.
2. **Create Dictionary and Corpus:** Use Gensim to create a dictionary and corpus for LDA.
3. **Train LDA Model:** Use LDA to find topics within the text data.

In [14]:
# Tokenization and stop words removal for LDA
texts = [text.split() for text in df['cleaned_text']]

# Create Dictionary and Corpus
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Train LDA Model
lda_model = models.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)

# Display Topics
topics = lda_model.print_topics(num_words=4)
for topic in topics:
    print(topic)

KeyError: 'cleaned_text'

### 7. Named Entity Recognition (NER)
**Objective:** Extract named entities (e.g., names, dates, locations) from text.

**Tasks:**
1. **Use NLTK for NER:** Tokenize text and apply NER using NLTK.

In [15]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Example text for NER
example_text = "Barack Obama was born on August 4, 1961, in Honolulu, Hawaii."

# Tokenize and perform POS tagging
tokens = nltk.word_tokenize(example_text)
pos_tags = nltk.pos_tag(tokens)

# Perform Named Entity Recognition
named_entities = nltk.ne_chunk(pos_tags)
print(named_entities)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\jason\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker.zip.
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\jason\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\jason/nltk_data'
    - 'c:\\Program Files (x86)\\Microsoft Visual Studio\\Shared\\Python39_64\\nltk_data'
    - 'c:\\Program Files (x86)\\Microsoft Visual Studio\\Shared\\Python39_64\\share\\nltk_data'
    - 'c:\\Program Files (x86)\\Microsoft Visual Studio\\Shared\\Python39_64\\lib\\nltk_data'
    - 'C:\\Users\\jason\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


### 8. Summary and Discussion
**Objective:** Reflect on the use of text mining techniques and discuss their implications in the context of business applications.

**Tasks:**
1. **Compare Techniques:** Discuss the results from sentiment analysis, topic modeling, and named entity recognition.
2. **Business Implications:** Describe how text mining can provide valuable insights for businesses, such as understanding customer feedback and monitoring social media.

### 9. Submission
**Deliverables:**
- Jupyter Notebook (.ipynb) with all code, visualizations, and analysis.
- A brief report (1-2 paragraphs) summarizing the findings, including sentiment analysis results, topic modeling insights, and NER outcomes.

**Deadline:** Submit your completed notebook and report to the course portal by the end of class.