Skip to content

JonasMok/ML_coursework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML_coursework

This is a repository for Machine learning coursework from Cardiff University.

Getting started

This repository is a part of Machine Learning (ML) coursework from MSc. Data Science and Analytics at Cardiff University. The objetive of this essay is provide a Python Machine Learning model able to do a sentiment analysis in a movie dataset review (IMDb).

Dataset

The dataset used in this exercise is avaliable in this repository. However, for those who wants to access it, this file is based IMDb Reviews. Go to the site and look for “Large Movie Review Dataset v1.0” and dowloaded it. After it, you can find in your downloaded folder a file called aclImdb_v1.tar.gz. Unpack it and save it on the same folder where you run your python file. After unpacked it, the dataset is provided as text files with positive and negative reviews. On the courserwork dataset file, the core dataset contains 25,000 reviews split into train, development and test sets. The overall distribution of labels is roughly balanced.

Files

There are two files:

  • Data set (it has 3 folders - train, develop and test. Each folder contains files with negative and positive reviews one review per line)
  • One Python code and;

This exercise was build to run in python using terminal. In order to better execute these instructions, I suggest to download the csv files avaliable here and save it in the same paste where the python file was saved.

Reference paper


@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

Installing packages

This code uses these follow packages


import numpy as np
import pandas as pd
import re
import nltk
import sklearn
import nltk

#Data Preprocessing and Feature Engineering
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, ENGLISH_STOP_WORDS
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score
from sklearn import svm
from nltk import sent_tokenize, word_tokenize
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

Import csv. files

The original datasets are splited in positive and negative text files. For this reason, it is necessary to merge both files contents. I used the follow commands to do that. First, we are going to create the train dataset using pandas.

Load positive and negative .txt files using panda libraries with new line (‘\n’) as criteria of separation. Where ‘imdb_train_pos.txt’ is the name of the positive review .txt file and 'imdb_train_neg.txt' for the negative reviews.

df_pos = pd.read_csv('imdb_train_pos.txt', sep='\n', header = None)
df_neg = pd.read_csv('imdb_train_neg.txt', sep='\n', header = None)

Then, create a second column [1] with reviews labels. For positive reviews put ‘1’ and negative ‘0’.

df_pos[1] = 1
df_neg[1] = 0
#print(df_pos.head())
#print(df_neg.head())

Now we are going to ‘concatenated’ both files.

df_train = pd.concat([df_pos,df_neg])

Then, we also can rename the columns

df_train.columns = ['text','label']

Now repeat this process to creat a test dataset and development dataset.

df_pos_test = pd.read_csv('imdb_test_pos.txt', sep='\n', header = None)
df_neg_test = pd.read_csv('imdb_test_neg.txt', sep='\n', header = None)

df_pos_test[1] = 1
df_neg_test[1] = 0

df_test = pd.concat([df_pos,df_neg])
df_test.columns = ['text','label']

Features extraction

Feature extraction/engineering are techniques for treating, extracting and reducing features to input in the machine learning models. It can reduces the time of the machine learning process and increases accuracy of the model, due to the reduction of the dimensions without losing important information. For this reason, feature extraction is essential to effetive machine learning model. This exercise applied the follow preprocessing and feature extractions methods:

  • Vectorization
  • Bag-of-words with 3-gram range
  • Stop words
  • Lemmatization
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
#lemmatization
class LemmaTokenizer:
 def __init__(self):
     self.wnl = WordNetLemmatizer()
 def __call__(self, doc):
     return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

#stopwords
my_stopwords = ENGLISH_STOP_WORDS.union(['@','<br />'])

#vectorization, bag of words and max. features
vect = CountVectorizer(max_features=1000, ngram_range = (1,3), stop_words = my_stopwords, tokenizer=LemmaTokenizer())
X = vect.fit_transform(df_train.text)
X_test = vect.fit_transform(df_test.text)

#Transform to an array
my_array = X.toarray()
my_array_test = X_test.toarray()
#Transform back to a dataframe, assign column names
X_df = pd.DataFrame(my_array, columns=vect.get_feature_names())
X_df_test = pd.DataFrame(my_array_test, columns=vect.get_feature_names())

Model - SVM model

After preprocessing and extract features, now we can train our machine learning model on train data set. I am using the SVM from sklearn package here. If it is necessary, we can use the development data set to do new features extraction tests, before put the model to run in test data set.

svm_review = sklearn.svm.SVC(kernel="linear",gamma='auto')
model = svm_review.fit(X_df,df_train.label)

Then, after we trainned our model and we are comfortable with the inital results from trained model, we can put it to run on the test data set and get some predictions with our model.

predictions  = model.predict(X_df_test)

Print confusion matrix, precision, recall, f-measure and accuracy of the model.

print(confusion_matrix(df_test.label,predictions))
print(classification_report(df_test.label,predictions))
print(accuracy_score(df_test.label, predictions))

Thank you, I wish you enjoyed this exercise

About

This is a repository for Machine learning coursework from Cardiff University.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages