<a href="https://colab.research.google.com/github/ManJ-PC/Psychosis-AI/blob/master/Text_classification2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Download the original dataset and clean it

* **training.1600000.processed.noemoticon.csv:** raw data from Sentiment140 - 1.4 million tweets tagged for sentiment, no column headers, nothing cleaned up

In [None]:
# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/sentiment-analysis-is-bad/data/training.1600000.processed.noemoticon.csv.zip -P data
!unzip -n -d data data/training.1600000.processed.noemoticon.csv.zip

File ‘data/training.1600000.processed.noemoticon.csv.zip’ already there; not retrieving.

Archive:  data/training.1600000.processed.noemoticon.csv.zip


##Import csv to dataframe

**Note** the dataset contains polarity, id, date, query, user and text columns, although not in CSV head row, so we have to name each column

In [None]:
import pandas as pd

df = pd.read_csv("data/training.1600000.processed.noemoticon.csv",
                names=['polarity', 'id', 'date', 'query', 'user', 'text'],
                encoding='latin-1')
df.head()

Unnamed: 0,polarity,id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


##Update polarity
Right now the polarity column is 0 for negative, 4 for positive. Let's change that to 0 and 1 to make things a little more reasonably readable.

How many positive and negative tweets there are?


In [None]:
df.polarity.value_counts()

4    800000
0    800000
Name: polarity, dtype: int64

Lets change polarity to 1


In [None]:
df.polarity = df.polarity.replace({0: 0, 4: 1})
df.polarity.value_counts()

1    800000
0    800000
Name: polarity, dtype: int64

##Remove columns we do not need

Like id, date, query and user.

In [None]:
df = df.drop(columns=['id', 'date', 'query', 'user'])
df.head()

Unnamed: 0,polarity,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


##Sampling

Lets reduce the data so we can work more easily.



In [None]:
df = df.sample(n=500000)
df.polarity.value_counts()

1    250439
0    249561
Name: polarity, dtype: int64

# Download cleaned data



Before we get started, we need to download all of the data we'll be using.
sentiment140-subset.csv: cleaned subset of Sentiment140 data - half a million tweets marked as positive or negative

In [None]:
# Make data directory if it doesn't exist
!mkdir -p data

# Download sentiment140-subset
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/investigating-sentiment-analysis/data/sentiment140-subset.csv.zip -P data
!unzip -n -d data data/sentiment140-subset.csv.zip

--2021-06-09 14:58:43--  https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/investigating-sentiment-analysis/data/sentiment140-subset.csv.zip
Resolving nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)... 162.243.189.2
Connecting to nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)|162.243.189.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17927149 (17M) [application/zip]
Saving to: ‘data/sentiment140-subset.csv.zip’


2021-06-09 14:58:44 (17.1 MB/s) - ‘data/sentiment140-subset.csv.zip’ saved [17927149/17927149]

Archive:  data/sentiment140-subset.csv.zip
  inflating: data/sentiment140-subset.csv  


Import the first 5000 rows

In [None]:
import pandas as pd

df = pd.read_csv("data/sentiment140-subset.csv", nrows=20000)
df.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was ..."
3,1,Wii fit says I've lost 10 pounds since last ti...
4,0,@MrKinetik Not a thing!!! I don't really have...


##Analyse data

How many rows?

In [None]:
df.shape

(20000, 2)

How many positive and negative tweets?

In [None]:
df.polarity.value_counts()

1    10012
0     9988
Name: polarity, dtype: int64

#Document representation

Transform the documents into feature vectors. Let's use TF-IDF representation.

We might want to keep just a smaller number of words as we can have computational constraints.

In [None]:
# !pip install sklearn

###Term frequency

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

[CountVectorizer parameters](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [None]:
vectorizer = CountVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,2nd,30,able,about,account,actually,add,after,afternoon,again,ago,agree,ah,ahh,ahhh,air,airport,album,all,almost,alone,already,alright,also,although,always,am,amazing,amp,an,and,annoying,...,words,work,worked,working,works,world,worried,worry,worse,worst,worth,would,wouldn,wow,write,writing,wrong,wtf,www,xd,xoxo,xx,xxx,ya,yay,yea,yeah,year,years,yep,yes,yesterday,yet,yo,you,your,yours,yourself,youtube,yup
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


###TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

[TfidfVectorizer parameters](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,2nd,30,able,about,account,actually,add,after,afternoon,again,ago,agree,ah,ahh,ahhh,air,airport,album,all,almost,alone,already,alright,also,although,always,am,amazing,amp,an,and,annoying,...,words,work,worked,working,works,world,worried,worry,worse,worst,worth,would,wouldn,wow,write,writing,wrong,wtf,www,xd,xoxo,xx,xxx,ya,yay,yea,yeah,year,years,yep,yes,yesterday,yet,yo,you,your,yours,yourself,youtube,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333412,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.221124,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.426042,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##Features and Labels

We now need to split data so one thing is the features - that represent documents - and labels, that represent if a document is positive or negative.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
features = words_df
labels = df.polarity

###Split data into training set and test set.

In [None]:
features_train, features_test, labels_train, labels_test = train_test_split(
...     features, labels, test_size=0.33, random_state=42)

#Algorithms

Let's try to classify tweets.

##Train the classification model


In [None]:
#from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

### Create and train a logistic regression classifier

In [None]:
%%time
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(features_train, labels_train)

CPU times: user 15.1 s, sys: 1.22 s, total: 16.3 s
Wall time: 8.4 s


### Create and train a random forest classifier

In [None]:
%%time
forest = RandomForestClassifier(n_estimators=50)
forest.fit(features_train, labels_train)

CPU times: user 7.07 s, sys: 93.7 ms, total: 7.17 s
Wall time: 7.06 s


### Create and train a linear support vector classifier (LinearSVC)


In [None]:
%%time
svc = LinearSVC()
svc.fit(features_train, labels_train)

CPU times: user 128 ms, sys: 0 ns, total: 128 ms
Wall time: 131 ms


### Create and train a multinomial naive bayes classifier (MultinomialNB)




In [None]:
%%time
bayes = MultinomialNB()
bayes.fit(features_train, labels_train)

CPU times: user 42.8 ms, sys: 4.98 ms, total: 47.8 ms
Wall time: 54.9 ms


# Test the learning models

Let's evaluate created models.

## Calculate predictions

### Test the logistic regression classifier

In [None]:
pred_logreg = logreg.predict(features_test)

### Test the random forest classifier



In [None]:
pred_forest = forest.predict(features_test)

### Test the linear support vector classifier (LinearSVC)

In [None]:
pred_svc = svc.predict(features_test)

###Test the multinomial naive bayes classifier (MultinomialNB)

In [None]:
pred_bayes = bayes.predict(features_test)

## Calculate performance results

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt  

Further read: https://towardsdatascience.com/understanding-the-confusion-matrix-from-scikit-learn-c51d88929c79

### Results of the logistic regression classifier



In [None]:
print (confusion_matrix(labels_test, pred_logreg))
print (classification_report(labels_test, pred_logreg))

[[2420  884]
 [ 832 2464]]
              precision    recall  f1-score   support

           0       0.74      0.73      0.74      3304
           1       0.74      0.75      0.74      3296

    accuracy                           0.74      6600
   macro avg       0.74      0.74      0.74      6600
weighted avg       0.74      0.74      0.74      6600



### Results of the random forest classifier

In [None]:
print (confusion_matrix(labels_test, pred_forest))
print (classification_report(labels_test, pred_forest))

[[2391  913]
 [ 993 2303]]
              precision    recall  f1-score   support

           0       0.71      0.72      0.72      3304
           1       0.72      0.70      0.71      3296

    accuracy                           0.71      6600
   macro avg       0.71      0.71      0.71      6600
weighted avg       0.71      0.71      0.71      6600



### Results of the linear support vector classifier (LinearSVC)

In [None]:
print (confusion_matrix(labels_test, pred_svc))
print (classification_report(labels_test, pred_svc))

[[2433  871]
 [ 820 2476]]
              precision    recall  f1-score   support

           0       0.75      0.74      0.74      3304
           1       0.74      0.75      0.75      3296

    accuracy                           0.74      6600
   macro avg       0.74      0.74      0.74      6600
weighted avg       0.74      0.74      0.74      6600



###Results of the multinomial naive bayes classifier (MultinomialNB)

In [None]:
print (confusion_matrix(labels_test, pred_bayes))
print (classification_report(labels_test, pred_bayes))

[[2500  804]
 [ 873 2423]]
              precision    recall  f1-score   support

           0       0.74      0.76      0.75      3304
           1       0.75      0.74      0.74      3296

    accuracy                           0.75      6600
   macro avg       0.75      0.75      0.75      6600
weighted avg       0.75      0.75      0.75      6600



Colab sources:

https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/sentiment-analysis-is-bad/notebooks/Cleaning%20the%20Sentiment140%20data.ipynb#scrollTo=pSMZSKlUpdmd

https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/investigating-sentiment-analysis/notebooks/Designing%20your%20own%20sentiment%20analysis%20tool.ipynb#scrollTo=x_fZWtUunLZV

https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea
