<a href="https://colab.research.google.com/github/Sakhakhini/1/blob/main/INFO5731_Assignment_Four_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.


# **Question 1: Topic Modeling**


(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA and LSA. The following information should be reported:

(1) Features (top n-gram phrases) used for topic modeling.

(2) Top 10 clusters for topic modeling.

(3) Summarize and describe the topic for each cluster.


In [1]:
 !pip install -q gensim
 !pip install -q pyLDAvis

[K     |████████████████████████████████| 1.7 MB 4.2 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
  Building wheel for pyLDAvis (PEP 517) ... [?25l[?25hdone


In [2]:
import warnings
from pprint import pprint
from typing import List, Tuple, Union

import gensim
import matplotlib.pyplot as plt
import pandas as pd
import pyLDAvis
import pyLDAvis.gensim_models
import spacy
import nltk
from nltk.corpus import stopwords

warnings.filterwarnings('ignore', category=DeprecationWarning)

# Prepare matplotlib
plt.style.use('ggplot')

# Prepare spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
# Prepare a list of stopwords for preprocessing step
nltk.download('stopwords')
stop_words = stopwords.words('english')


def word_tokenize(sentences: List[str]):
    return [gensim.utils.simple_preprocess(s, deacc=True) for s in sentences]


def to_bigrams(sentences: List[str],
               bigram_model: gensim.models.phrases.Phraser):
    return [bigram_model[s] for s in sentences]


def to_trigrams(sentences: List[str],
                bigram_model: gensim.models.phrases.Phraser,
                trigram_model: gensim.models.phrases.Phraser):
    return [trigram_model[bigram_model[s]] for s in sentences]


def lemmatize(sentences: List[str],
              pos_tags: List[str] = ['NOUN', 'ADJ', 'VERB', 'ADV']):
    lemmatized = []
    for sentence in sentences:
        doc = nlp(' '.join(sentence))
        lemmatized.append(
            [token.lemma_ for token in doc if token.pos_ in pos_tags])

    return lemmatized


def visualize_topics(model: Union[gensim.models.LdaModel,
                                  gensim.models.LsiModel],
                     id2word: gensim.corpora.Dictionary,
                     corpus: List[Tuple[int, int]]):
    # Visualize the topics
    pyLDAvis.enable_notebook()
    topics_vis = pyLDAvis.gensim_models.prepare(model, corpus, id2word)
    return topics_vis


  from collections import Iterable
  formatvalue=lambda value: "")[1:-1]
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  from .mio5_utils import VarReader5


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
# Load the dataset
df = pd.read_csv('./cleaned_reviews_github.csv')

# Remove stopwords from the text
df['review'] = df['review'].apply(
    lambda x: ' '.join([w for w in x.split() if w not in stop_words]))

# Tokenize the sentences
data_words = word_tokenize(df['review'])

# Build bigram & trigram models
bigram_phrases = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram_phrases = gensim.models.Phrases(bigram_phrases[data_words],
                                        threshold=100)

bigram_model = gensim.models.phrases.Phraser(bigram_phrases)
trigram_model = gensim.models.phrases.Phraser(trigram_phrases)

# Transform sentences into bigrams
data_bigrams = to_bigrams(data_words, bigram_model)

# Lemmatize the tokens and keep only nouns, adj, verb, and adv
data_lemmatized = lemmatize(data_bigrams)

# Create corpus
id2word = gensim.corpora.Dictionary(data_lemmatized)
corpus = [id2word.doc2bow(x) for x in data_lemmatized]

# Create model
model = gensim.models.LdaModel(corpus=corpus,
                               num_topics=10,
                               id2word=id2word,
                               random_state=100,
                               update_every=1,
                               chunksize=100,
                               passes=10,
                               alpha='auto',
                               per_word_topics=True)

pprint(model.print_topics())



[(0,
  '0.022*"step" + 0.021*"aesthetic" + 0.017*"budget" + 0.017*"sadly" + '
  '0.017*"brand" + 0.015*"sign" + 0.014*"rarely" + 0.013*"fear" + '
  '0.013*"reach" + 0.013*"exaggerated"'),
 (1,
  '0.032*"must" + 0.027*"combine" + 0.024*"engage" + 0.020*"legend_ten" + '
  '0.018*"ring" + 0.016*"familiar" + 0.016*"strange" + 0.016*"lack" + '
  '0.014*"simple" + 0.013*"similar"'),
 (2,
  '0.022*"power" + 0.019*"creature" + 0.016*"father" + 0.015*"satisfy" + '
  '0.013*"evil" + 0.012*"effort" + 0.011*"sister" + 0.011*"friend" + '
  '0.011*"totally" + 0.010*"praise"'),
 (3,
  '0.049*"emotion" + 0.036*"display" + 0.016*"unnecessary" + 0.016*"handle" + '
  '0.015*"double" + 0.015*"personally" + 0.014*"visible" + 0.014*"account" + '
  '0.011*"song" + 0.010*"distinct"'),
 (4,
  '0.023*"slow" + 0.022*"ring" + 0.021*"genuinely" + 0.019*"mystical" + '
  '0.019*"beautifully" + 0.017*"slattery" + 0.016*"major" + 0.016*"human" + '
  '0.014*"furthermore" + 0.014*"ground"'),
 (5,
  '0.070*"movie" + 0.02

In [4]:
visualize_topics(model, id2word, corpus)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


# **Question 2: Sentiment Analysis**


(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.

(1) Features used for sentiment classification and explain why you select these features.

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.


In [5]:
import numpy as np
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Extract features and targets from dataframe
X, y = df['review'], df['sentiment']

# Create pipelines for different models
nb_clf = Pipeline([('count_vectorizer', CountVectorizer()),
                   ('tfidf_transformer', TfidfTransformer()),
                   ('nb', MultinomialNB())])
sgdc_clf = Pipeline([('count_vectorizer', CountVectorizer()),
                     ('tfidf_transformer', TfidfTransformer()),
                     ('nb', SGDClassifier())])

# Perform cross validation over the models
scoring = {
    'accuracy': metrics.make_scorer(metrics.accuracy_score),
    'precision': metrics.make_scorer(metrics.precision_score, average='micro'),
    'recall': metrics.make_scorer(metrics.recall_score, average='micro'),
    'f1': metrics.make_scorer(metrics.f1_score, average='micro')
}
nb_scores = cross_validate(nb_clf,
                           X,
                           y,
                           cv=5,
                           scoring=scoring,
                           return_train_score=True,
                           n_jobs=-1)
sgdc_scores = cross_validate(sgdc_clf,
                             X,
                             y,
                             cv=5,
                             scoring=scoring,
                             return_train_score=True,
                             n_jobs=-1)


def print_scores(model_name, scores):
    print(f'{model_name} scores:')
    for score, values in scores.items():
        print(f'\t{score}: {np.mean(values):.2f}')


print_scores('MultinomialNB', nb_scores)
print_scores('SGDClassifier', sgdc_scores)


MultinomialNB scores:
	fit_time: 0.15
	score_time: 0.04
	test_accuracy: 0.72
	train_accuracy: 0.72
	test_precision: 0.72
	train_precision: 0.72
	test_recall: 0.72
	train_recall: 0.72
	test_f1: 0.72
	train_f1: 0.72
SGDClassifier scores:
	fit_time: 0.16
	score_time: 0.04
	test_accuracy: 0.80
	train_accuracy: 1.00
	test_precision: 0.80
	train_precision: 1.00
	test_recall: 0.80
	train_recall: 1.00
	test_f1: 0.80
	train_f1: 1.00


# **Question 3: House price prediction**


(40 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download here: https://github.com/unt-iialab/info5731-spring2022/blob/main/assignments/assignment4-question3-data.zip. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.


In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler

# Load training and test data
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

# Preprocess the data
Xy_train = df_train.dropna(axis=1)
X_train, y_train = Xy_train.drop(['SalePrice'], axis=1), Xy_train['SalePrice']
X_test = df_test[X_train.columns]

# Fill null values in the test dataset 
fill_mode = lambda col: col.fillna(col.mode()[0])
X_test = X_test.apply(fill_mode, axis=0)

# Select categorical columns for encoding 
categorical_cols = X_train.select_dtypes(exclude=['number']).columns

# Prepare pipeline
reg_pipeline = Pipeline([
    ('transformer',
     ColumnTransformer([
         ('encoder', OrdinalEncoder(), categorical_cols)
     ],
                       remainder='passthrough')),
    ('scaler', MinMaxScaler()),
    ('reg', RandomForestRegressor()),
])

# Train the regression model
reg_pipeline.fit(X_train, y_train)
y_train_pred = reg_pipeline.predict(X_train)

train_score = metrics.r2_score(y_train, y_train_pred)
print(f'Train R2 Score: {train_score}')

Train R2 Score: 0.9825035547268258


In [7]:
# Predict and save predictions for the test data
y_test_pred = reg_pipeline.predict(X_test)
y_test_pred = pd.Series(y_test_pred)
y_test_pred.to_csv('test_predictions.csv')