# Report
## Predict Submission
https://www.kaggle.com/t/cb6ceb3bf96a48819d6b4f0994fb58db
## Features
### Good features for this problem
1. are able capture the distinctive aspects of someone’s writing style, and 
2. are consistent even when the author is writing on different subjects.

### Features may works
- Lexical features:
 - The average number of words per sentence
 - Sentence length variation
 - Lexical diversity, which is a measure of the richness of the author’s vocabulary
- Punctuation features:
 - Average number of commas, semicolons and colons per sentence
- average length of words
- the frequency of digits used
- the frequency of letters used

## References
Authorship Attribution with Python
http://www.aicbt.com/authorship-attribution/

Ultimate guide to deal with Text Data
https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/

Comprehensive Hands on Guide to Twitter Sentiment Analysis with dataset and code
https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/

## Supervised Learning
(need labels, gather ground truth from external source)
k-Nearest Neighbors
Linear Regression
Logistic Regression
Support Vector Machines（SVMs）
Decision Trees and Random Forests
Neural networks

## Unsupervised Learning
(do not need labels, the analysis is conducted without ground truth. )
- Clustering
 - k-Means
 - Hierarchical Cluster Analysis（HCA）
 - Expectation Maximization
 
- Visualization and dimensionality reduction
 - Principal Component Analysis（PCA）
 - Kernel PCA
 - Locally-Linear Embedding（LLE）
 - t-distributed Stochastic Neighbor Embedding（t-SNE）
- Association rule learning
 - Apriori
 - Ecla
 

---

# Personal Trials
consider the unsupervised problem. There are three steps:

1. Preparing and loading the data
2. Feature extraction: We will experiment with a few different feature sets. Even though the focus is on the unsupervised problem, the feature extraction code can also be used for supervised learning.
3. Classification: We will use clustering to find natural groupings in the data. Since we have several feature sets, we will use ensemble learning: learn multiple models, each built using different features, that vote to determine who wrote each chapter.

## Import Libraries and Create Global Data

In [35]:
import numpy as np
import pandas as pd
import re
import sklearn
import nltk
import copy
import csv
from nltk.corpus import stopwords
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import PorterStemmer # removal of suffices, like “ing”, “ly”, “s”, etc.
from textblob import TextBlob
from textblob import Word

stop = stopwords.words('english')

## Global Function

In [2]:
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

def readLabelData(path):
    fo = open(path, "r")
    data = fo.readlines();
    fo.close()
    res = []
    for x in data:
        x = x.rstrip('\n')
        id, tweet = x.split('\t')
        res.append([int(id), tweet])
    return res

def readUnlabelData(path):
    fo = open(path, "r")
    data = fo.readlines();
    fo.close()
    res = []
    for x in data:
        x = x.rstrip('\n')
        res.append(x)
    return res

## Import Data

In [37]:

df = pd.read_csv('data/train_tweets.txt', 
                 encoding="utf-8",
                 header=None, sep='\t',
                quoting=csv.QUOTE_NONE)
# data = readLabelData('data/train_tweets.txt')
# df = pd.DataFrame(data)
df.columns = ['id', 'tweet']
print(df.shape)

(328932, 2)


## Feature Extraction

1. Number of words
2. Number of characters(with spaces or without space is giving)
3. Average Word Length
4. Number of stopwords
5. Number of special characters
6. Number of numerics
7. Number of Uppercase words
8. Number of punctuation

In [4]:
def createFeature(df):
    df['words'] = df['tweet'].apply(lambda x: len(str(x).split(" ")))
    df['chars'] = df['tweet'].str.len() ## this also includes spaces
# charNum = df['tweet'].apply(lambda x: 
#                                    len(str(x).replace(" ", "")))
    df['avg_word'] = df['tweet'].apply(lambda x: avg_word(x))
    df['stopwords'] = df['tweet'].apply(lambda x: 
                                    len([x for x in x.split() 
                                         if x in stop]))
    df['hastags'] = df['tweet'].apply(lambda x: 
                               len([x for x in x.split() 
                                    if x.startswith('#')]))
    df['numerics'] = df['tweet'].apply(lambda x: 
                             len([x for x in x.split() 
                                  if x.isdigit()]))
    df['upper'] = df['tweet'].apply(lambda x: 
                          len([x for x in x.split() 
                               if x.isupper()]))
    df['punctuation'] = df['tweet'].str.replace('[\w\s]','').apply(lambda x:
                                                        len(x))
    return df

In [5]:
df['words'] = df['tweet'].apply(lambda x: len(str(x).split(" ")))

In [6]:
df['chars'] = df['tweet'].str.len() ## this also includes spaces
# charNum = df['tweet'].apply(lambda x: 
#                                    len(str(x).replace(" ", "")))

In [7]:
df['avg_word'] = df['tweet'].apply(lambda x: avg_word(x))

In [8]:
df['stopwords'] = df['tweet'].apply(lambda x: 
                                    len([x for x in x.split() 
                                         if x in stop]))

In [9]:
df['hastags'] = df['tweet'].apply(lambda x: 
                               len([x for x in x.split() 
                                    if x.startswith('#')]))

In [10]:
df['numerics'] = df['tweet'].apply(lambda x: 
                             len([x for x in x.split() 
                                  if x.isdigit()]))

In [11]:
df['upper'] = df['tweet'].apply(lambda x: 
                          len([x for x in x.split() 
                               if x.isupper()]))

In [12]:
df['punctuation'] = df['tweet'].str.replace('[\w\s]','').apply(lambda x:
                                                        len(x))

In [26]:
df.head()

Unnamed: 0,id,tweet,words,chars,avg_word,stopwords,hastags,numerics,upper,punctuation
0,8746,@handle Let's try and catch up live next week!,9,46,4.222222,2,0,0,0,3
1,8746,Going to watch Grey's on the big screen - Thur...,11,66,5.090909,3,0,0,0,7
2,8746,@handle My pleasure Patrick....hope you are well!,7,49,6.142857,2,0,0,0,6
3,8746,@handle Hi there! Been traveling a lot and lot...,27,132,3.925926,9,0,0,0,6
4,8746,RT @handle Looking to Drink Clean & Go Green? ...,19,109,4.789474,4,0,0,1,5


## Basic Pre-processing (Don't Run)
cleaning the data in order to obtain better features.

In [14]:
preProcess = copy.deepcopy(df['tweet'])

In [15]:
# 1. Lower case
preProcess = preProcess.apply(lambda x: 
                                      " ".join(x.lower() 
                                               for x in x.split()))

In [28]:
# 2. Removing Punctuation
preProcess = preProcess.str.replace('[^\w\s]','')

In [17]:
# 3. Removal of Stop Words
preProcess = preProcess.apply(lambda x: 
                                " ".join(x for x in x.split() 
                                         if x not in stop))

In [18]:
# 4. Common word removal
NUM_TOP_WORDS = 10
freq = pd.Series(' '.join(preProcess).split()).value_counts()[:NUM_TOP_WORDS]
freq_index = list(freq.index)
preProcess = preProcess.apply(lambda x: 
                                      " ".join(x for x in x.split() 
                                               if x not in freq))

In [19]:
# 5. Rare words removal
NUM_TAIL_WORDS = -10
freq = pd.Series(' '.join(preProcess).split()).value_counts()[NUM_TAIL_WORDS:]
freq_index = list(freq.index)
preProcess = preProcess.apply(lambda x: 
                                " ".join(x for x in x.split() 
                                         if x not in freq))

In [20]:
# 6. Spelling correction(take a lot of time)
# preProcess.apply(lambda x: str(TextBlob(x).correct()))

In [21]:
# 7. Tokenization
# TextBlob(preProcess[1]).words

In [22]:
# 8. Stemming
# st = PorterStemmer()
# preProcess.apply(lambda x: 
#                  " ".join([st.stem(word) 
#                            for word in x.split()]))

In [23]:
# 9. Lemmatization
# usually prefer using lemmatization over stemming.
preProcess = preProcess.apply(lambda x: 
                              " ".join([Word(word).lemmatize() 
                                        for word in x.split()]))

## Classification

In [14]:
features = ['words','chars','avg_word','stopwords','hastags','numerics','upper','punctuation']
df[['id'] + features].head()

Unnamed: 0,id,words,chars,avg_word,stopwords,hastags,numerics,upper,punctuation
0,8746,9,46,4.222222,2,0,0,0,3
1,8746,11,66,5.090909,3,0,0,0,7
2,8746,7,49,6.142857,2,0,0,0,6
3,8746,27,132,3.925926,9,0,0,0,6
4,8746,19,109,4.789474,4,0,0,1,5


In [30]:
df[features].describe()

Unnamed: 0,words,chars,avg_word,stopwords,hastags,numerics,upper,punctuation
count,328932.0,328932.0,328932.0,328932.0,328932.0,328932.0,328932.0,328932.0
mean,13.708602,84.229306,5.600385,3.657908,0.154175,0.119003,0.77126,6.354721
std,6.657503,37.117228,2.445895,3.118349,0.556568,0.396859,1.538788,4.18556
min,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,8.0,55.0,4.4,1.0,0.0,0.0,0.0,3.0
50%,13.0,85.0,5.142857,3.0,0.0,0.0,0.0,6.0
75%,19.0,118.0,6.25,6.0,0.0,0.0,1.0,9.0
max,38.0,150.0,140.0,24.0,17.0,14.0,31.0,128.0


In [15]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score 

x_train, x_test, y_train, y_test = train_test_split(df[features], 
                                                    df.id, 
                                                    test_size=0.1,
                                                    shuffle=True,
                                                    random_state=4) 

### 1.Decision Tree

In [16]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(x_train, y_train)
clf.feature_importances_

array([0.05175706, 0.24713924, 0.25608045, 0.15293236, 0.02665349,
       0.01810787, 0.07846067, 0.16886885])

In [17]:
predict_train = clf.predict(x_train)
predict_test = clf.predict(x_test)
print('f1 for train = ' , f1_score(y_train, predict_train, average='micro'))
print('f1 for test = ' , f1_score(y_test, predict_test, average='micro'))

f1 for train =  0.5889277727859261
f1 for test =  0.04651304189213838


In [19]:
# train the whole data
# clf.fit(df[features], df.id)

### 2.Naive Bayes (work like shit)

In [22]:
from sklearn import linear_model
clf_bayes = linear_model.BayesianRidge()
clf_bayes.fit(x_train, y_train)
predict_train = clf_bayes.predict(x_train).round() #取整
 = clf_bayes.predict(x_test).round()

print('f1 for train = ' , f1_score(y_train, predict_train, average='micro'))
print('f1 for test = ' , f1_score(y_test, predict_test, average='micro'))

f1 for train =  0.0003276606381613171
f1 for test =  0.00024320544780203075


### 3.Kmeans (more worse)

In [23]:
from sklearn.cluster   import KMeans 
#使用默认的K-Means算法  
num_clusters = 4
kmeans_clf = KMeans(n_clusters=num_clusters)  
kmeans_clf.fit(x_train)  

predict_train = kmeans_clf.predict(x_train)
predict_test = kmeans_clf.predict(x_test)
print('f1 for train = ' , f1_score(y_train, predict_train, average='micro'))
print('f1 for test = ' , f1_score(y_test, predict_test, average='micro'))

[2 3 2 ... 3 0 1]
f1 for train =  2.3645613063187833e-05
f1 for test =  0.00012160272390101538


### 4.KNN

In [32]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(x_train,y_train)
predict_train = knn_clf.predict(x_train)
predict_test = knn_clf.predict(x_test)
print('f1 for train = ' , f1_score(y_train, predict_train, average='micro'))
print('f1 for test = ' , f1_score(y_test, predict_test, average='micro'))

f1 for train =  0.20923665205142583
f1 for test =  0.035477594698121236


---

# Submision Creating

In [39]:
unLabel = pd.read_csv('data/test_tweets_unlabeled.txt', 
                      header=None,
                      sep='\t', 
                      quoting=csv.QUOTE_NONE)
# data = readUnlabelData('data/test_tweets_unlabeled.txt')
# unLabel = pd.DataFrame(data)
unLabel.columns = ['tweet']
unLabel.shape

(35437, 1)

In [40]:
createFeature(unLabel)
predict_answer = clf.predict(unLabel[features])

In [41]:
unLabel.head()

Unnamed: 0,tweet,words,chars,avg_word,stopwords,hastags,numerics,upper,punctuation
0,Some people say that rappers don’t have feelin...,23,133,4.826087,8,0,0,0,5
1,Do you know how to tweet on a Blackberry 8830?...,15,92,5.2,6,0,0,0,3
2,"""Yoga is the cessation of mind."" -Patanjali",7,43,5.285714,3,0,0,0,4
3,"@handle Well, with my millions of dollars, a f...",18,99,4.555556,9,0,0,0,7
4,Cambria hotels free guide http://hotels.izigot...,13,118,8.153846,0,0,1,0,9


In [38]:
with open("submission.txt", "w") as f:
    f.write('Id,Predicted\n')
    index = 0
    for i in predict_answer:
        index += 1
        f.write(str(index) + ',' + str(i) + '\n')

---

# Method Trials


## NLTK

In [None]:
import nltk

In [None]:
# 统计词频
nltk.FreqDist(tokens)

## Re

In [None]:
import re
matchObj = re.match(r'\S'*)
s1 = "asad a a sdas da as das "

## N-grams

In [None]:
TextBlob(preProcess[0]).ngrams(2)

## TF-IDF

### Term frequency
TF = (Number of times term T appears in the particular row) / (number of terms in that row)

In [None]:
tf1 = (preProcess[0:5]).apply(lambda x: 
                              pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']

### Inverse Document Frequency
IDF = log(N/n)

In [None]:
for i,word in enumerate(tf1['words']):
  tf1.loc[i, 'idf'] = np.log(preProcess.shape[0]/(len(preProcess[preProcess.str.contains(word)])))

In [None]:
tf1['tfidf'] = tf1['tf'] * tf1['idf']

### sklearn has a separate function to directly obtain it:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
 stop_words= 'english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(preProcess)

## Bag of Words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(preProcess)

## Word2Vec
https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

## TextBlob
https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/