# Applied tweets classification with python

In [14]:
import pandas as pd 
import numpy as np
# Read the the training data 
df=pd.read_csv('C:/Users/TOCHIBA/desktop/train.csv')
df.head(10)

Unnamed: 0,TweetId,Label,TweetText
0,304271250237304833,Politics,'#SecKerry: The value of the @StateDept and @U...
1,304834304222064640,Politics,'@rraina1481 I fear so'
2,303568995880144898,Sports,'Watch video highlights of the #wwc13 final be...
3,304366580664528896,Sports,'RT @chelscanlan: At Nitro Circus at #AlbertPa...
4,296770931098009601,Sports,'@cricketfox Always a good thing. Thanks for t...
5,306713195832307712,Politics,'Dr. Rajan: Fiscal consolidation will create m...
6,306100962337112064,Politics,"FACT: More than 800,000 defense employees will..."
7,305951758759366657,Sports,"'1st Test. Over 39: 0 runs, 1 wkt (M Wade 0, M..."
8,304482567158104065,Sports,Some of Africa's top teams will try and take a...
9,303806584964935680,Sports,'Can you beat the tweet of @RoryGribbell and z...


In [15]:
print(df.shape)

(6525, 3)


## Evaluating for missing data in Label column 

In [16]:
missing_data=df['Label'].isnull()
missing_data.head()

0    False
1    False
2    False
3    False
4    False
Name: Label, dtype: bool

In [17]:
print(missing_data.value_counts())

False    6525
Name: Label, dtype: int64


So now we are sure that there is no missing value in our target column .

For the purpose of simplicity I am going to classify our data as Politics or not Politics 

In [18]:
df['target']= np.where(df['Label']=='Politics',1,0)
df.head(10)

Unnamed: 0,TweetId,Label,TweetText,target
0,304271250237304833,Politics,'#SecKerry: The value of the @StateDept and @U...,1
1,304834304222064640,Politics,'@rraina1481 I fear so',1
2,303568995880144898,Sports,'Watch video highlights of the #wwc13 final be...,0
3,304366580664528896,Sports,'RT @chelscanlan: At Nitro Circus at #AlbertPa...,0
4,296770931098009601,Sports,'@cricketfox Always a good thing. Thanks for t...,0
5,306713195832307712,Politics,'Dr. Rajan: Fiscal consolidation will create m...,1
6,306100962337112064,Politics,"FACT: More than 800,000 defense employees will...",1
7,305951758759366657,Sports,"'1st Test. Over 39: 0 runs, 1 wkt (M Wade 0, M...",0
8,304482567158104065,Sports,Some of Africa's top teams will try and take a...,0
9,303806584964935680,Sports,'Can you beat the tweet of @RoryGribbell and z...,0


In [19]:
df['target'].mean()

0.4904214559386973

We notice that most tweets in our dataset are not Politics which mean they are about sports

## Model development

In [20]:
from sklearn.model_selection import train_test_split
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['TweetText'], 
                                                    df['target'], 
                                                    random_state=0)

In [21]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 'President Obama Calls for Humility at the National Prayer Breakfast http://t.co/3jlQQmPx'


X_train shape:  (4893,)


Looking at X_train, we can see we have a series of over 4893 tweets or documents. We'll need to convert these into a numeric representation that scikit-learn can use. The bag-of-words approach is simple and commonly used way to represent text for use in machine learning, which ignores structure and only counts how often each word occurs. CountVectorizer allows us to use the bag-of-words approach by converting a collection of text documents into a matrix of token counts.

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)

In [24]:
vect.get_feature_names()[::200]

['00',
 '177',
 '24mo',
 '477',
 '6urb8kla',
 'aarava_on_ps3',
 'afaqsworld',
 'amenable',
 'arians',
 'avenue',
 'bbge09su',
 'bjp',
 'brewing',
 'calangute',
 'cfsm8p9kcu',
 'class',
 'composed',
 'county',
 'd18mtttv',
 'delivered',
 'displaying',
 'dubai',
 'eliminate',
 'est',
 'fabwmyt0',
 'financl',
 'fourth',
 'gen',
 'grandson',
 'happe',
 'hlvoeg',
 'hydroelectric',
 'induced',
 'islamist',
 'johnsuncricket',
 'khandm',
 'lamb',
 'lined',
 'madrid',
 'mcchrystal',
 'min',
 'muller',
 'next',
 'nytimes',
 'ophz1f97kt',
 'part',
 'pilots',
 'prague',
 'proud2bs',
 'qzlj4fnt',
 'recorded',
 'respect',
 'ronthedon08',
 'santa',
 'sendblueroohome',
 'sides',
 'sometime',
 'statedept',
 'summer',
 'tayyip',
 'three',
 'transfer',
 'u2018heroic',
 'until',
 'vhfu1nz3re',
 'warmongers',
 'winter',
 'xn1vrumsvs',
 'zgc4fy6i']

Looking at every 200th feature, we can get a small sense of what the vocabulary looks like. We can see it looks pretty messy, including words with numbers as well as misspellings.

Next, we'll use the transform method to transform the documents in X_train to a document term matrix, giving us the bag-of-word representation of X_train.This representation is stored in a SciPy sparse matrix, where each row corresponds to a document and each column a word from our training vocabulary.


In [25]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)
X_train_vectorized

<4893x13656 sparse matrix of type '<class 'numpy.int64'>'
	with 79440 stored elements in Compressed Sparse Row format>

In [26]:
len(vect.get_feature_names())

13656

Now we'll start building our model which is the Logistic Regression model . The good thing about the logistic regression model is it is  very efficient  for binanry classification because it provides the probablity for each predicted label,also it works well for high dimensional sparse data

In [27]:
from sklearn.linear_model import LogisticRegression
# Train the model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [28]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.9437280028674396


In [29]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['atp' 'cfc' 'gt' 'indvaus' 'cricket' 'bbl02' 'f1' 'gbfedcup' 'ausgp'
 'game']

Largest Coefs: 
['nelsonmandela' 'pm' 'obama' 'medvedev' 'president' 'minister' 'meeting'
 'seckerry' 'bcim2013' 'thank']


we start notice that the vocabulary now is much more significant

Next, let's look at a different approach, which allows us to rescale features called tf–idf,Tf–idf, or Term frequency-inverse document frequency, allows us to weight terms based on how important they are to a document.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())

2217

In [31]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.9231604486097286




In [32]:
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['oldham' 'efc' '1st' 'tags' 'table' 'ble' 'emily' 'schmidt' 'suga' 'gone']

Largest tfidf: 
['agree' 'love' 'enjoy' 'astonvilla' 'no' 'embracetherace' 'nice'
 'challenge' 'ff' 'check']


In [33]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['atp' 'cfc' 'indvaus' 'bbl02' 'gt' 'game' 'tennis' 'cricket' 'test' 'f1']

Largest Coefs: 
['nelsonmandela' 'president' 'pm' 'obama' 'medvedev' 'of' 'minister'
 'people' 'meeting' 'eu']


## Model Evaluation

Now let's evaluate our model using the test data 

In [35]:
# Let's read the test data from my local computer
df1=pd.read_csv('C:/Users/TOCHIBA/desktop/test.csv')
df1.head(10)

Unnamed: 0,TweetId,TweetText
0,306486520121012224,'28. The home side threaten again through Maso...
1,286353402605228032,'@mrbrown @aulia Thx for asking. See http://t....
2,289531046037438464,'@Sochi2014 construction along the shores of t...
3,306451661403062273,'#SecKerry\u2019s remarks after meeting with F...
4,297941800658812928,'The #IPLauction has begun. Ricky Ponting is t...
5,305722428531802112,'Viswanathan Anand draws with Fabiano Caruana ...
6,304713516256997377,Have your say on tonight's game - send a text ...
7,234999630725783553,"'The #olympics may be over, but the #paralympi..."
8,303712268372283392,"'@richaanirudh big compliment, thanks!'"
9,304215754130194432,'Espargar\xf3 @PolEspargaro quickest as Jerez ...


In [36]:
print(df1.shape)

(2610, 2)


In [37]:
x_test=df1['TweetText']
print('x_test first entry:\n\n', x_test.iloc[0])

x_test first entry:

 '28. The home side threaten again through Mason Bennett after he gets on the end of a long, long throw and stabs a yard wide.'


In [39]:
# Predict the transformed test documents
yhat = model.predict(vect.transform(x_test))

yhat[:100]

array([0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1])