# Topic Modelling and Sentiment Analysis of Text Message Data

## Explore the Dataset

In [15]:
# Import the Dataset
import pandas as pd
from IPython.display import display, HTML

originalData = pd.read_csv('./clean_nus_sms.csv')
display(HTML(originalData[0:11].to_html()))

Unnamed: 0.1,Unnamed: 0,id,Message,length,country,Date
0,0,10120,Bugis oso near wat...,21,SG,2003/4
1,1,10121,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",111,SG,2003/4
2,2,10122,I dunno until when... Lets go learn pilates...,46,SG,2003/4
3,3,10123,"Den only weekdays got special price... Haiz... Cant eat liao... Cut nails oso muz wait until i finish drivin wat, lunch still muz eat wat...",140,SG,2003/4
4,4,10124,Meet after lunch la...,22,SG,2003/4
5,5,10125,m walking in citylink now ü faster come down... Me very hungry...,65,SG,2003/4
6,6,10126,5 nights...We nt staying at port step liao...Too ex,51,SG,2003/4
7,7,10127,Hey pple...$700 or $900 for 5 nights...Excellent location wif breakfast hamper!!!,81,SG,2003/4
8,8,10128,"Yun ah.the ubi one say if ü wan call by tomorrow.call 67441233 look for irene.ere only got bus8,22,65,61,66,382. Ubi cres,ubi tech park.6ph for 1st 5wkg days.èn",160,SG,2003/4
9,9,10129,Hey tmr maybe can meet you at yck,33,SG,2003/4


## Goal

It seems the dataset doesn't come pre-labelled with sentiment. We might need to use TextBlob to label the data. The ultimate goal of this project is to conduct sentiment analysis on the text messages and find the most common topics users text about. Perhaps it would be interesting to perform this comparison by date and see how trends change over time, or, perform comparisons by country. 

## Preprocessing

My plans for preprocessing are:
- Use TextBlob Python Library to perform spelling correction.
- Noise Removal: Use Regex to remove punctuations/accents, special characters, numeric digits, [leading, ending, and vertical] whitespace, and HTML formatting. 
- Tokenization: Break the text messages into smaller components (text -> sentence level -> word level).
- Normalization: Convert all text to lowercase, remove stopwords, apply stemming and lemmatization (powered by POS-tagging). 

- Is there a way to remove grammar contractions (i.e. don’t → do not)? And to convert emojis to the appropriate meaning of their occurence in the document. 

In [14]:
import re
text = "  Yun ah.the ubi one say if ü wan call by tomorrow.call 67441233 look for irene.ere only got bus8,22,65,61,66,382. Ubi cres,ubi tech park.6ph for 1st 5wkg days.èn	 "
cleaned = re.sub(r'\W+', ' ', text).lower()
print(cleaned)

 yun ah the ubi one say if ü wan call by tomorrow call 67441233 look for irene ere only got bus8 22 65 61 66 382 ubi cres ubi tech park 6ph for 1st 5wkg days èn 
