# 1. References

Title: Title Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.

Link: https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a


# 2. Imports

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import nltk
# nltk.download()
from nltk.stem.snowball import SnowballStemmer

# 3. Data cleaning

In [2]:
DATA_DIR = "../../data/raw/"
INPUT_FILE_NAME = 'subset_raw.parquet'

In [3]:
df = pd.read_parquet(DATA_DIR + INPUT_FILE_NAME)
df.head()

Unnamed: 0_level_0,speaker,headline,description,duration,tags,transcript,WC
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Al Gore,Averting the climate crisis,With the same humor and humanity he exuded in ...,0:16:17,"cars,alternative energy,culture,politics,scien...","0:14\r\r\rThank you so much, Chris.\rAnd it's ...",2281.0
2,Amy Smith,Simple designs to save a life,Fumes from indoor cooking fires kill more than...,0:15:06,"MacArthur grant,simplicity,industrial design,a...","0:11\r\r\rIn terms of invention,\rI'd like to ...",2687.0
3,Ashraf Ghani,How to rebuild a broken state,Ashraf Ghani's passionate and powerful 10-minu...,0:18:45,"corruption,poverty,economics,investment,milita...","0:12\r\r\rA public, Dewey long ago observed,\r...",2506.0
4,Burt Rutan,The real future of space exploration,"In this passionate talk, legendary spacecraft ...",0:19:37,"aircraft,flight,industrial design,NASA,rocket ...","0:11\r\r\rI want to start off by saying, Houst...",3092.0
5,Chris Bangle,Great cars are great art,American designer Chris Bangle explains his ph...,0:20:04,"cars,industrial design,transportation,inventio...","0:12\r\r\rWhat I want to talk about is, as bac...",3781.0


In [4]:
df.iloc[:,:15].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2475 entries, 1 to 2804
Data columns (total 7 columns):
speaker        2475 non-null object
headline       2475 non-null object
description    2475 non-null object
duration       2475 non-null object
tags           2475 non-null object
transcript     2386 non-null object
WC             2386 non-null float64
dtypes: float64(1), object(6)
memory usage: 154.7+ KB


In [5]:
df = df.dropna(subset=['transcript'])
df = df.reset_index(drop=True)
df.iloc[:,:15].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2386 entries, 0 to 2385
Data columns (total 7 columns):
speaker        2386 non-null object
headline       2386 non-null object
description    2386 non-null object
duration       2386 non-null object
tags           2386 non-null object
transcript     2386 non-null object
WC             2386 non-null float64
dtypes: float64(1), object(6)
memory usage: 130.6+ KB


In [6]:
df['transcript'][0]

'0:14\r\r\rThank you so much, Chris.\rAnd it\'s truly a great honor\rto have the opportunity\rto come to this stage twice;\rI\'m extremely grateful.\rI have been blown away by this conference,\rand I want to thank all of you\rfor the many nice comments\rabout what I had to say the other night.\rAnd I say that sincerely,\rpartly because (Mock sob)\rI need that.\r\r\r\r\r 0:40\r\r\r(Laughter)\r\r\r\r\r 0:45\r\r\rPut yourselves in my position.\r\r\r\r\r 0:47\r\r\r(Laughter)\r\r\r\r\r 0:54\r\r\rI flew on Air Force Two for eight years.\r\r\r\r\r 0:57\r\r\r(Laughter)\r\r\r\r\r 0:59\r\r\rNow I have to take off my shoes\ror boots to get on an airplane!\r\r\r\r\r 1:02\r\r\r(Laughter)\r\r\r\r\r 1:05\r\r\r(Applause)\r\r\r\r\r 1:11\r\r\rI\'ll tell you one quick story\rto illustrate what\rthat\'s been like for me.\r\r\r\r\r 1:16\r\r\r(Laughter)\r\r\r\r\r 1:18\r\r\rIt\'s a true story \revery bit of this is true.\r\r\r\r\r 1:21\r\r\rSoon after Tipper and I left the \r(Mock sob) White House \r\r\r\r\r

In [7]:
'''
1. Numbers
2. Apostrophe
3. All punctuations
4. Weird symbols
5. Stop words
6. lemmatization
'''

import string
from nltk.corpus import stopwords
from nltk.tokenize import ToktokTokenizer
from sklearn.feature_extraction import stop_words
from nltk.stem.wordnet import WordNetLemmatizer
sets=[stop_words.ENGLISH_STOP_WORDS]
sklearnStopWords = [list(x) for x in sets][0]
token=ToktokTokenizer()
lemma=WordNetLemmatizer()
stopWordList=stopwords.words('english')
stopWords = stopWordList + sklearnStopWords
stopWords = list(dict.fromkeys(stopWords))


def stopWordsRemove(text):
    wordList=[x.lower().strip() for x in token.tokenize(text)]
    removedList=[x + ' ' for x in wordList if not x in stopWords]
    text=''.join(removedList)
    return text


def lemitizeWords(text):
    words=token.tokenize(text)
    listLemma=[]
    for w in words:
        x=lemma.lemmatize(w,'v')
        listLemma.append(x)
    return text


# There is a mispelt word that needs to be replaced
df['transcript'] = df['transcript'].str.replace('childrn','children')

df['transcript'] = df['transcript'].str.replace('\r',' ')
df['transcript'] = df['transcript'].str.replace("\'s"," is")
df['transcript'] = df['transcript'].str.replace("\'m"," am")
df['transcript'] = df['transcript'].str.replace("\'ll"," will")
df['transcript'] = df['transcript'].str.replace("Can\'t","cannot")
df['transcript'] = df['transcript'].str.replace("Sha\'t","shall not")
df['transcript'] = df['transcript'].str.replace("Won\'t","would not")
df['transcript'] = df['transcript'].str.replace("n\'t"," not")
df['transcript'] = df['transcript'].str.replace("\'ve"," have")
df['transcript'] = df['transcript'].str.replace("\'re"," are")
df['transcript'] = df['transcript'].str.replace("\'d"," would")
df['transcript'] = df['transcript'].str.replace(r"\(([^)]+)\)","")
# Deal with Mr. and Dr.
df['transcript'] = df['transcript'].str.replace("mr. ","mr")
df['transcript'] = df['transcript'].str.replace("Mr. ","mr")
df['transcript'] = df['transcript'].str.replace("dr. ","dr")
df['transcript'] = df['transcript'].str.replace("mrs. ","mrs")
df['transcript'] = df['transcript'].str.replace("Mrs. ","mrs")
df['transcript'] = df['transcript'].str.replace("Dr. ","dr")

df['transcript'] = df['transcript'].str.replace(r'\d+','')
df['transcript'] = df['transcript'].str.replace(r'<.*?>','')
for i in string.punctuation:
    if i == "'":
        df['transcript'] = df['transcript'].str.replace(i,'')
    else:
        df['transcript'] = df['transcript'].str.replace(i,' ')
df['transcript'] = df['transcript'].map(lambda com : stopWordsRemove(com))
df['transcript'] = df['transcript'].map(lambda com : lemitizeWords(com))
df['transcript'] = df['transcript'].str.replace('\s+',' ')


In [8]:
df['transcript'][0]

'thank chris truly great honor opportunity come stage twice extremely grateful blown away conference want thank nice comments say night say sincerely partly need position flew air force years shoes boots airplane tell quick story illustrate like true story bit true soon tipper left white house driving home nashville little farm miles east nashville driving know sounds like little thing looked rear view mirror sudden hit motorcade heard phantom limb pain rented ford taurus dinnertime started looking place eat got exit lebanon tennessee got exit shoney restaurant low cost family restaurant chain know went sat booth waitress came big commotion tipper took order went couple booth lowered voice really strain hear saying said yes vice president al gore wife tipper man said come long way kind series epiphanies day continuing totally true story got g v fly africa make speech nigeria city lagos topic energy began speech telling story happened day nashville told pretty way shared tipper driving 

## 3.1 Compare Tags to get a single tag

In [9]:
def tag_selection(df=df):
    complete_transcripts_tags = []
    for rows, value in df.iterrows():
        indiv_tags = value['tags'].split(',')
        longest_tag = max(indiv_tags, key=len)
        indiv_transcript_tags = [value['transcript'], longest_tag]
        complete_transcripts_tags.append(indiv_transcript_tags)
    return pd.DataFrame(complete_transcripts_tags, columns=['transcript', 'tags'])
tag_cleaned = tag_selection()
tag_cleaned

Unnamed: 0,transcript,tags
0,thank chris truly great honor opportunity come...,alternative energy
1,terms invention like tell tale favorite projec...,alternative energy
2,public dewey long ago observed constituted dis...,global development
3,want start saying houston problem entering sec...,industrial design
4,want talk background idea cars art actually qu...,industrial design
...,...,...
2381,imagine walked evening discovered everybody ro...,urban planning
2382,paying close attention easy attention pulled d...,cognitive science
2383,happy pic taken senior college right dance pra...,work-life balance
2384,seven year old grandson sleeps hall wakes lot ...,personal growth


In [10]:
# Check unique tags
tags_cleaned_up = tag_cleaned['tags'].unique()
print(len(tags_cleaned_up))

202


# 4. ML part

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
# do the train test split 
transcript = tag_cleaned['transcript'].to_numpy()
tags = tag_cleaned['tags'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(
     transcript, tags, test_size=0.2, random_state=42)

In [13]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(1908, 51326)

In [14]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape


(1908, 51326)

In [15]:
clf = MultinomialNB().fit(X_train_tfidf, y_train)


In [16]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

text_clf = text_clf.fit(X_train, y_train)

In [17]:
predicted = text_clf.predict(X_test)
# print(y_test)
# print(predicted)
np.mean(predicted == y_test)

0.1401673640167364

In [18]:
sample_ls = ['sample text']
sample_ls = np.array(sample_ls)
predicted_new = text_clf.predict(sample_ls)
print(predicted_new)

['global issues']


## 4.2 SVM Model

In [19]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), 
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                                   alpha=1e-3, n_iter_no_change=5, random_state=42)),
                        ])
text_clf_svm = text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_train)
np.mean(predicted_svm == y_test)

  


0.0

## 4.3 Gridsearch

In [20]:
from sklearn.model_selection import GridSearchCV
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)],
                      'tfidf__use_idf': (True, False), 
                      'clf-svm__alpha': (1e-2, 1e-3),}
gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(X_train, y_train)
gs_clf_svm.best_score_
gs_clf_svm.best_params_



{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}

In [21]:
gs_clf_svm.best_score_

0.2510482180293501

In [22]:
from joblib import dump, load
#dump(gs_clf_svm, 'gs_clf_svm.joblib') 

['gs_clf_svm.joblib']

In [23]:
sample_ls = ["As New Yorkers, we're often busy looking up at the development going on around us. We rarely stop to consider what lies beneath the city streets. And it's really hard to imagine that this small island village would one day become a forest of skyscrapers. Yet, as an urban archaeologist, that's exactly what I do. I consider landscapes, artifacts to tell the stories of the people who walked these streets before us. Because history is so much more than facts and figures. When people think of archaeology, they usually think of dusty old maps, far off lands, ancient civilizations. You don't think New York City and construction sites. Yet, that's where all the action happens and we're never sure exactly what we're going to find beneath the city streets. Like this wooden well ring which was the base for the construction of a water well. It provided us an opportunity to take a sample of the wood for tree-ring dating, and get a date to confirm the fact that we had indeed found a series of 18th-century structures beneath Fulton Street. Archaeology is about everyday people using everyday objects, like the child who may have played with this small toy, or the person who consumed the contents of this bottle. This bottle contained water imported from Germany and dates to 1790. Now okay, we know New Yorkers always had to go to great lengths to get fresh drinking water. Small island, you really couldn't drink the well water, it was to brackish. But the notion that New Yorkers were importing bottled water from Europe, more then two hundred years ago, is truly a testament to the fact that New York City is a cosmopolitan city, always has been, where you could get practically anything from anywhere. If you and I were to walk through City Hall Park, you might see an urban park and government offices. I see New York City's largest and most complex archaeological site. And it's significant not because it's City Hall, but because of the thousands of poor prisoners and British soldiers who lived and died here. Before it was City Hall Park, the area was known as The Common, and it was pretty far outside the city limits. In the 17th century, it was a place for public protests and execution. "]
sample_ls = np.array(sample_ls)
predicted_new = gs_clf_svm.predict(sample_ls)
print(predicted_new)

['architecture']
