# Report
## Features
### Good features for this problem
1. are able capture the distinctive aspects of someone’s writing style, and 
2. are consistent even when the author is writing on different subjects.

### Features may works
- Lexical features:
 - The average number of words per sentence
 - Sentence length variation
 - Lexical diversity, which is a measure of the richness of the author’s vocabulary
- Punctuation features:
 - Average number of commas, semicolons and colons per sentence
- average length of words
- the frequency of digits used
- the frequency of letters used

## References
Authorship Attribution with Python
http://www.aicbt.com/authorship-attribution/

Ultimate guide to deal with Text Data
https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/

Comprehensive Hands on Guide to Twitter Sentiment Analysis with dataset and code
https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/

## Supervised Learning
(need labels, gather ground truth from external source)
k-Nearest Neighbors
Linear Regression
Logistic Regression
Support Vector Machines（SVMs）
Decision Trees and Random Forests
Neural networks

## Unsupervised Learning
(do not need labels, the analysis is conducted without ground truth. )
- Clustering
 - k-Means
 - Hierarchical Cluster Analysis（HCA）
 - Expectation Maximization
 
- Visualization and dimensionality reduction
 - Principal Component Analysis（PCA）
 - Kernel PCA
 - Locally-Linear Embedding（LLE）
 - t-distributed Stochastic Neighbor Embedding（t-SNE）
- Association rule learning
 - Apriori
 - Ecla

---

# Personal trials
consider the unsupervised problem. There are three steps:

1. Preparing and loading the data
2. Feature extraction: We will experiment with a few different feature sets. Even though the focus is on the unsupervised problem, the feature extraction code can also be used for supervised learning.
3. Classification: We will use clustering to find natural groupings in the data. Since we have several feature sets, we will use ensemble learning: learn multiple models, each built using different features, that vote to determine who wrote each chapter.

## import libraries and create global data

In [11]:
import numpy as np
import pandas as pd
import re
import sklearn
import nltk
import copy
from nltk.corpus import stopwords
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import PorterStemmer # removal of suffices, like “ing”, “ly”, “s”, etc.
from textblob import TextBlob

stop = stopwords.words('english')

## global function

In [4]:
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

## import data

In [8]:
# DataFrame
df = pd.read_csv('data/train_tweets.txt', header=None, sep='\t')
df.columns = ['id', 'tweet']
df.head()

Unnamed: 0,id,tweet
0,8746,@handle Let's try and catch up live next week!
1,8746,Going to watch Grey's on the big screen - Thur...
2,8746,@handle My pleasure Patrick....hope you are well!
3,8746,@handle Hi there! Been traveling a lot and lot...
4,8746,RT @handle Looking to Drink Clean & Go Green? ...


## Feature Extraction

1. Number of words
2. Number of characters(with spaces or without space is giving)
3. Average Word Length
4. Number of stopwords
5. Number of special characters
6. Number of numerics
7. Number of Uppercase words

In [6]:
words = df['tweet'].apply(lambda x: len(str(x).split(" ")))
chars = df['tweet'].str.len() ## this also includes spaces
# charNum = df['tweet'].apply(lambda x: 
#                                    len(str(x).replace(" ", "")))
avg_word = df['tweet'].apply(lambda x: avg_word(x))
stopwords = df['tweet'].apply(lambda x: 
                                    len([x for x in x.split() 
                                         if x in stop]))
hastags = df['tweet'].apply(lambda x: 
                               len([x for x in x.split() 
                                    if x.startswith('#')]))
numerics = df['tweet'].apply(lambda x: 
                             len([x for x in x.split() 
                                  if x.isdigit()]))
upper = df['tweet'].apply(lambda x: 
                          len([x for x in x.split() 
                               if x.isupper()]))


## Basic Pre-processing
cleaning the data in order to obtain better features.

In [10]:
preProcess = copy.deepcopy(f['tweet'])
# Lower case
preProcess = preProcess.apply(lambda x: 
                                      " ".join(x.lower() 
                                               for x in x.split()))
# Removing Punctuation
preProcess = preProcess.str.replace('[^\w\s]','')
# Removal of Stop Words
preProcess = preProcess.apply(lambda x: 
                                " ".join(x for x in x.split() 
                                         if x not in stop))
# Common word removal
freq = pd.Series(' '.join(preProcess).split()).value_counts()[:10]
freq_index = list(freq.index)
preProcess = preProcess.apply(lambda x: 
                                      " ".join(x for x in x.split() 
                                               if x not in freq))
# Rare words removal
freq = pd.Series(' '.join(preProcess).split()).value_counts()[-10:]
freq_index = list(freq.index)
preProcess = preProcess.apply(lambda x: 
                                " ".join(x for x in x.split() 
                                         if x not in freq))
# Spelling correction(take a lot of time)
# preProcess.apply(lambda x: str(TextBlob(x).correct()))

# Tokenization
# TextBlob(preProcess[1]).words

# 

WordList(['going', 'watch', 'greys', 'big', 'screen', 'thursday', 'indulgence'])

---

# Method Trials

## NLTK

In [2]:
import nltk

In [None]:
# 统计词频
nltk.FreqDist(tokens)

## Re

In [20]:
import re
matchObj = re.match(r'\S'*)
s1 = "asad a a sdas da as das "