## 1. Textual feature -- Label Classification

### Multinomial NB

* Why multinomial NB?  
    - cause it's fast and efficient

In [None]:
# load packages needed
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [None]:
# split the data into train set and test set
trans_train, trans_test, label_train, label_test = train_test_split(ted_clean.transcript, ted_clean.label, 
                                                                    test_size=0.2, random_state=0)

In [None]:
NBmodel = make_pipeline(TfidfVectorizer(max_features=1500, stop_words='english'),
                        MultinomialNB())
NBmodel.fit(trans_train, label_train)


accuracy_score(label_test, NBmodel.predict(trans_test))

0.6979166666666666

In [None]:
# change the max_feature to 3000
NBmodel = make_pipeline(TfidfVectorizer(max_features=3000, stop_words='english'),
                        MultinomialNB())
NBmodel.fit(trans_train, label_train)


accuracy_score(label_test, NBmodel.predict(trans_test))

0.703125

In [None]:
# make a heatmap
label_pred = accuracy_score(label_test, NBmodel.predict(trans_test))

from sklearn.metrics import confusion_matrix

mat = confusion_matrix(label_test, label_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cmap="Reds")
# fmt: either 's' or 'd'; s = string, d = decimal
# square: the shape of each cell
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show()

* Very slight increase.. 
* then how about 5000 features?

In [None]:
# change max_feature to 5000 to see if there is any difference
NBmodel = make_pipeline(TfidfVectorizer(max_features=5000, stop_words='english'),
                        MultinomialNB())
NBmodel.fit(trans_train, label_train)


accuracy_score(label_test, NBmodel.predict(trans_test))

0.6927083333333334

* ohh it decreased even more than that of 1500 features?!

* So basically, NB model says it is not possible to predict the rate of positive review of the talks based off of its transcript.

### SVM

* Why SVM?  
    - Cause I am trying to predict the categorical variable, and SVM is pretty much for the exact purpose. (Also wanted to adopt a model little more complicated than NB model.)

In [None]:
# calcualte tf-idf before applying the raw data into svm
vectorz = TfidfVectorizer(max_df=0.5, min_df=2, max_features=1500, stop_words='english')

tf_train = vectorz.fit_transform(trans_train)
tf_test = vectorz.transform(trans_test)

In [None]:
svcmdl = SVC(kernel='linear', C=1E5)
svcmdl.fit(tf_train, label_train)

SVC(C=100000.0, kernel='linear')

In [None]:
accuracy_score(label_test, svcmdl.predict(tf_test))

0.6302083333333334

* It is even lower than the base line accuracy.
* We can conclude that it is not possible to predict the popularity of the talks based on their transcript.
* No difference in the content between the popular ones and less popular ones.
* This is quite expected considering that the whole ratings themselves are strongly skewed to positive ratings (mean pos rate: 91%.)

## 2. K-band & `obnoxious`

### Feature Enginieering Part

* Sub-hypothesis to check: the higher the k-band is, the higher the percentage of the rating `obnoxious`

In [None]:
# make a toy data set 
toy_df = ted_clean.head(10)

In [None]:
# calculate the percentage of the rating 'obnoxious' on toy_df
total = 0
obn_score = 0
obs_perc = []

for i in toy_df.ratings_tuple:
    total = 0
    obn_score = 0
    for (x, y) in i:
        if x == 'Obnoxious':
            obn_score += int(y)
            total += int(y)
        else:
            total += int(y)
    obs_perc.append(obn_score/total)

In [None]:
# check if it works
obs_perc

[0.0022269579115610015,
 0.04461852861035422,
 0.05028328611898017,
 0.009388412017167383,
 0.002380952380952381,
 0.021815576973170096,
 0.048349449816605536,
 0.0048828125,
 0.021033958438925495,
 0.044553860934310074]

In [None]:
# apply the function to the original data set
total = 0
obn_score = 0
obs_perc = []


for i in ted_clean.ratings_tuple:
    total = 0
    obn_score = 0
    for (x, y) in i:
        if x == 'Obnoxious':
            obn_score += int(y)
            total += int(y)
        else:
            total += int(y)
    obs_perc.append((obn_score/total)*100)

In [None]:
# make another column of the percentage of rating `obnoxious`
ted_clean['obs_perc'] = obs_perc

In [None]:
# look at the data form
ted_clean.obs_perc.describe()

count    957.000000
mean       1.523199
std        2.328065
min        0.000000
25%        0.367366
50%        0.844206
75%        1.746324
max       36.014819
Name: obs_perc, dtype: float64

### Regression

In [None]:
# split the data set into train and test set
kband_train, kband_test, obs_train, obs_test = train_test_split(ted_clean[['kband_ave']], ted_clean.obs_perc,
                                                                test_size = 1/5, random_state=0)

In [None]:
# import linear regression and run the model
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(kband_train, obs_train)

LinearRegression()

In [None]:
# calculate the coefficient; the strength of the correlation
regressor.coef_

array([0.1122869])

* Low correlation between k-band and the rating `obnoxious`

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(obs_test, regressor.predict(kband_test))

1.241665956367778

* On average, this regression model is off by 1.24 point.

## 3. Sentence length and `longwinded`

* ANOTHER sub-hypothesis: what about the sentence length and the rating `longwinded`?

### Feature Enginieering Part

In [None]:
# make another column for the ratio of `long-winded`
long_perc = []

for i in ted_clean.ratings_tuple:
    total = 0
    long_score = 0
    for (x, y) in i:
        if x == 'Longwinded':
            long_score += int(y)
            total += int(y)
        else:
            total += int(y)
    long_perc.append((long_score/total)*100)

In [None]:
# insert the list into the data frame and check
ted_clean['long_perc'] = long_perc
ted_clean.head(4)

In [None]:
# calculate the sentence length of each talk
ted_clean['sent_len'] = ted_clean.transcript.map(lambda x: len(nltk.sent_tokenize(x)))

### Regression

In [None]:
# split the data set into train and test data
sent_train, sent_test, long_train, long_test = train_test_split(ted_clean[['sent_len']], ted_clean.long_perc,
                                                                test_size=0.2, random_state=2)

In [None]:
# run the linear regression
regressor = LinearRegression()
regressor.fit(sent_train, long_train)

LinearRegression()

In [None]:
# calculate the coefficient; the strength of the correlation
regressor.coef_

array([0.00919571])

* Very very low.. so, there is no correlation between the sentence length of the talk and the possibility of getting rated as `longwinded`