# 1. References

Title: Title Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.

Link: https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a


# 2. Imports

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import nltk
# nltk.download()
from nltk.stem.snowball import SnowballStemmer

# 3. Data cleaning

In [2]:
DATA_DIR = "../../data/processed/"
INPUT_FILE_NAME = 'cleaned_squashed15_final.parquet'

In [3]:
df = pd.read_parquet(DATA_DIR + INPUT_FILE_NAME)
df.head()

Unnamed: 0,speaker,headline,description,duration,tags,transcript,WC,clean_transcript,clean_transcript_string,sim_tags,squash15_tags
0,Al Gore,Averting the climate crisis,With the same humor and humanity he exuded in ...,0:16:17,"cars,alternative energy,culture,politics,scien...","0:14\r\r\rThank you so much, Chris.\rAnd it's ...",2281.0,"[thank, chris, truly, great, honor, opportunit...",thank chris truly great honor opportunity come...,"cars,solar system,energy,culture,politics,scie...","culture,politics,science,global issues,technology"
1,Amy Smith,Simple designs to save a life,Fumes from indoor cooking fires kill more than...,0:15:06,"MacArthur grant,simplicity,industrial design,a...","0:11\r\r\rIn terms of invention,\rI'd like to ...",2687.0,"[term, invention, like, tell, tale, favorite, ...",term invention like tell tale favorite project...,"macarthur grant,simplicity,design,solar system...","design,global issues"
2,Ashraf Ghani,How to rebuild a broken state,Ashraf Ghani's passionate and powerful 10-minu...,0:18:45,"corruption,poverty,economics,investment,milita...","0:12\r\r\rA public, Dewey long ago observed,\r...",2506.0,"[public, dewey, long, ago, observe, constitute...",public dewey long ago observe constitute discu...,"corruption,inequality,science,investment,war,c...","science,culture,politics,global issues,business"
3,Burt Rutan,The real future of space exploration,"In this passionate talk, legendary spacecraft ...",0:19:37,"aircraft,flight,industrial design,NASA,rocket ...","0:11\r\r\rI want to start off by saying, Houst...",3092.0,"[want, start, say, houston, problem, enter, se...",want start say houston problem enter second ge...,"flight,design,nasa,science,invention,entrepren...","design,science,business"
4,Chris Bangle,Great cars are great art,American designer Chris Bangle explains his ph...,0:20:04,"cars,industrial design,transportation,inventio...","0:12\r\r\rWhat I want to talk about is, as bac...",3781.0,"[want, talk, background, idea, car, art, actua...",want talk background idea car art actually mea...,"cars,design,transportation,invention,technolog...","design,technology,business,science"


In [4]:
df.iloc[:,:15].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2313 entries, 0 to 2312
Data columns (total 11 columns):
speaker                    2313 non-null object
headline                   2313 non-null object
description                2313 non-null object
duration                   2313 non-null object
tags                       2313 non-null object
transcript                 2313 non-null object
WC                         2313 non-null float64
clean_transcript           2313 non-null object
clean_transcript_string    2313 non-null object
sim_tags                   2313 non-null object
squash15_tags              2313 non-null object
dtypes: float64(1), object(10)
memory usage: 198.9+ KB


In [5]:
df = df.dropna(subset=['clean_transcript_string'])
df = df.reset_index(drop=True)
df.iloc[:,:15].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2313 entries, 0 to 2312
Data columns (total 11 columns):
speaker                    2313 non-null object
headline                   2313 non-null object
description                2313 non-null object
duration                   2313 non-null object
tags                       2313 non-null object
transcript                 2313 non-null object
WC                         2313 non-null float64
clean_transcript           2313 non-null object
clean_transcript_string    2313 non-null object
sim_tags                   2313 non-null object
squash15_tags              2313 non-null object
dtypes: float64(1), object(10)
memory usage: 198.9+ KB


In [6]:
df['clean_transcript_string'][0]

'thank chris truly great honor opportunity come stage twice extremely grateful blow away conference want thank nice comment night sincerely partly sob need position fly air force year shoe boot airplane tell quick story illustrate like true story bite true soon tipper leave sob white house drive home nashville little farm mile east nashville drive know sound like little thing look rearview mirror sudden hit motorcade hear phantom limb pain rent ford taurus dinnertime start look place eat get exit lebanon tennessee get exit shoneys restaurant lowcost family restaurant chain know go sit booth waitress come big commotion tipper take order go couple booth low voice strain hear say say yes vice president al gore wife tipper man say come long way kind series epiphany day continue totally true story get gv fly africa speech nigeria city lagos topic energy begin speech tell story happen day nashville tell pretty way share tipper drive shoneys lowcost family restaurant chain man say laugh give 

In [None]:
# '''
# 1. Numbers
# 2. Apostrophe
# 3. All punctuations
# 4. Weird symbols
# 5. Stop words
# 6. lemmatization
# '''

# import string
# from nltk.corpus import stopwords
# from nltk.tokenize import ToktokTokenizer
# from sklearn.feature_extraction import stop_words
# from nltk.stem.wordnet import WordNetLemmatizer
# sets=[stop_words.ENGLISH_STOP_WORDS]
# sklearnStopWords = [list(x) for x in sets][0]
# token=ToktokTokenizer()
# lemma=WordNetLemmatizer()
# stopWordList=stopwords.words('english')
# stopWords = stopWordList + sklearnStopWords
# stopWords = list(dict.fromkeys(stopWords))


# def stopWordsRemove(text):
#     wordList=[x.lower().strip() for x in token.tokenize(text)]
#     removedList=[x + ' ' for x in wordList if not x in stopWords]
#     text=''.join(removedList)
#     return text


# def lemitizeWords(text):
#     words=token.tokenize(text)
#     listLemma=[]
#     for w in words:
#         x=lemma.lemmatize(w,'v')
#         listLemma.append(x)
#     return text


# # There is a mispelt word that needs to be replaced
# df['transcript'] = df['transcript'].str.replace('childrn','children')

# df['transcript'] = df['transcript'].str.replace('\r',' ')
# df['transcript'] = df['transcript'].str.replace("\'s"," is")
# df['transcript'] = df['transcript'].str.replace("\'m"," am")
# df['transcript'] = df['transcript'].str.replace("\'ll"," will")
# df['transcript'] = df['transcript'].str.replace("Can\'t","cannot")
# df['transcript'] = df['transcript'].str.replace("Sha\'t","shall not")
# df['transcript'] = df['transcript'].str.replace("Won\'t","would not")
# df['transcript'] = df['transcript'].str.replace("n\'t"," not")
# df['transcript'] = df['transcript'].str.replace("\'ve"," have")
# df['transcript'] = df['transcript'].str.replace("\'re"," are")
# df['transcript'] = df['transcript'].str.replace("\'d"," would")
# df['transcript'] = df['transcript'].str.replace(r"\(([^)]+)\)","")
# # Deal with Mr. and Dr.
# df['transcript'] = df['transcript'].str.replace("mr. ","mr")
# df['transcript'] = df['transcript'].str.replace("Mr. ","mr")
# df['transcript'] = df['transcript'].str.replace("dr. ","dr")
# df['transcript'] = df['transcript'].str.replace("mrs. ","mrs")
# df['transcript'] = df['transcript'].str.replace("Mrs. ","mrs")
# df['transcript'] = df['transcript'].str.replace("Dr. ","dr")

# df['transcript'] = df['transcript'].str.replace(r'\d+','')
# df['transcript'] = df['transcript'].str.replace(r'<.*?>','')
# for i in string.punctuation:
#     if i == "'":
#         df['transcript'] = df['transcript'].str.replace(i,'')
#     else:
#         df['transcript'] = df['transcript'].str.replace(i,' ')
# df['transcript'] = df['transcript'].map(lambda com : stopWordsRemove(com))
# df['transcript'] = df['transcript'].map(lambda com : lemitizeWords(com))
# df['transcript'] = df['transcript'].str.replace('\s+',' ')


In [None]:
# df['transcript'][0]

## 3.1 Compare Tags to get a single tag

In [7]:
def count_tags(tag_column):
    tags = tag_column.str.replace(', ', ',').str.lower().str.strip()
    joined_tags = tags.str.cat(sep=',').split(',')
    all_tags_w_dup = pd.Series(joined_tags)

    tag_counts = all_tags_w_dup.value_counts()
    tag_list = list(tag_counts.index)
    return tag_counts, tag_list

In [8]:
tag_counts, tag_list = count_tags(df['squash15_tags'])
tag_counts

science          1467
culture          1155
technology        787
global issues     679
design            477
history           385
business          349
entertainment     285
media             279
biomechanics      220
future            218
biodiversity      218
humanity          217
politics          199
communication     185
dtype: int64

In [9]:
def tag_selection(df=df):
    complete_transcripts_tags = []
    for rows, value in df.iterrows():
        indiv_tags = value['squash15_tags'].split(',')
#         print(indiv_tags)
        best_tag = ''
        best_tag_count = 0
        for i in range(len(indiv_tags)):
            tag = indiv_tags[i]
            tag_count = tag_counts[tag]
            if tag_count > best_tag_count:
                best_tag = tag
                best_tag_count = tag_count
#         longest_tag = max(indiv_tags, key=len)
        indiv_transcript_tags = [value['clean_transcript_string'], best_tag]
        complete_transcripts_tags.append(indiv_transcript_tags)
    return pd.DataFrame(complete_transcripts_tags, columns=['clean_transcript_string', 'squash15_tags'])
tag_cleaned = tag_selection()
tag_cleaned

Unnamed: 0,clean_transcript_string,squash15_tags
0,thank chris truly great honor opportunity come...,science
1,term invention like tell tale favorite project...,global issues
2,public dewey long ago observe constitute discu...,science
3,want start say houston problem enter second ge...,science
4,want talk background idea car art actually mea...,science
5,break ask people comment age debate comment un...,science
6,music sound silence simon garfunkel hello voic...,science
7,kurt andersen like architect david hog limelig...,culture
8,point time come learn morning world expert gue...,science
9,legitimate concern aid avian flu hear brillian...,science


In [None]:
# Check unique tags
tags_cleaned_up = tag_cleaned['tags'].unique()
print(len(tags_cleaned_up))

# 4. ML part

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# do the train test split 
transcript = tag_cleaned['transcript'].to_numpy()
tags = tag_cleaned['tags'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(
     transcript, tags, test_size=0.2, random_state=42)

In [None]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

In [None]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape


In [None]:
clf = MultinomialNB().fit(X_train_tfidf, y_train)


In [None]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

text_clf = text_clf.fit(X_train, y_train)

In [None]:
predicted = text_clf.predict(X_test)
# print(y_test)
# print(predicted)
np.mean(predicted == y_test)

In [None]:
sample_ls = ['sample text']
sample_ls = np.array(sample_ls)
predicted_new = text_clf.predict(sample_ls)
print(predicted_new)

## 4.2 SVM Model

In [None]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), 
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                                   alpha=1e-3, n_iter_no_change=5, random_state=42)),
                        ])
text_clf_svm = text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_train)
np.mean(predicted_svm == y_test)

## 4.3 Gridsearch

In [None]:
from sklearn.model_selection import GridSearchCV
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)],
                      'tfidf__use_idf': (True, False), 
                      'clf-svm__alpha': (1e-2, 1e-3),}
gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(X_train, y_train)
gs_clf_svm.best_score_
gs_clf_svm.best_params_

In [None]:
gs_clf_svm.best_score_

In [None]:
from joblib import dump, load
#dump(gs_clf_svm, 'gs_clf_svm.joblib') 

In [None]:
sample_ls = ["As New Yorkers, we're often busy looking up at the development going on around us. We rarely stop to consider what lies beneath the city streets. And it's really hard to imagine that this small island village would one day become a forest of skyscrapers. Yet, as an urban archaeologist, that's exactly what I do. I consider landscapes, artifacts to tell the stories of the people who walked these streets before us. Because history is so much more than facts and figures. When people think of archaeology, they usually think of dusty old maps, far off lands, ancient civilizations. You don't think New York City and construction sites. Yet, that's where all the action happens and we're never sure exactly what we're going to find beneath the city streets. Like this wooden well ring which was the base for the construction of a water well. It provided us an opportunity to take a sample of the wood for tree-ring dating, and get a date to confirm the fact that we had indeed found a series of 18th-century structures beneath Fulton Street. Archaeology is about everyday people using everyday objects, like the child who may have played with this small toy, or the person who consumed the contents of this bottle. This bottle contained water imported from Germany and dates to 1790. Now okay, we know New Yorkers always had to go to great lengths to get fresh drinking water. Small island, you really couldn't drink the well water, it was to brackish. But the notion that New Yorkers were importing bottled water from Europe, more then two hundred years ago, is truly a testament to the fact that New York City is a cosmopolitan city, always has been, where you could get practically anything from anywhere. If you and I were to walk through City Hall Park, you might see an urban park and government offices. I see New York City's largest and most complex archaeological site. And it's significant not because it's City Hall, but because of the thousands of poor prisoners and British soldiers who lived and died here. Before it was City Hall Park, the area was known as The Common, and it was pretty far outside the city limits. In the 17th century, it was a place for public protests and execution. "]
sample_ls = np.array(sample_ls)
predicted_new = gs_clf_svm.predict(sample_ls)
print(predicted_new)