In [1]:
import pandas as pd
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [2]:
df = pd.read_csv('data/train.csv')

# Brief look at the data

In [3]:
df.sample(20)

Unnamed: 0,id,keyword,location,text,target
3182,4568,emergency%20plan,"Calgary,AB, Canada",The City has activated the Municipal Emergency...,1
5688,8118,rescued,"Philadelphia, Pennsylvania USA",Homeless Man Traveling Across USA With 11 Stra...,0
5531,7889,quarantined,china,Top link: Reddit's new content policy goes int...,0
175,251,ambulance,"New Orleans, LA",Leading emergency services boss welcomes new a...,0
2358,3394,demolition,US-PR,@Treyarch @DavidVonderhaar bring back demolit...,0
2890,4153,drown,,'Save me from my self don't let me drown'.,0
1068,1543,bomb,Canada,@CranBoonitz So going to make any bomb threats...,0
2088,3001,dead,Milton Keynes,Can't believe Ross is dead???????? @emmerdale ...,0
7419,10612,wounded,,Police Officer Wounded Suspect Dead After Exch...,1
5361,7650,panic,,The cool kids asked me if I wanted to hang out...,0


# Clean the text up a little bit

In [4]:
df['text'] = df['text'].apply(lambda x: x.lower()) # convert to lowercase
df['text'] = df['text'].apply(lambda x: x.replace("#", "")) # remove 

In [5]:
df['text'] = df['text'].apply(lambda x: " ".join([x for x in x.split(" ") if "http" not in x])) # remove hypterlinks
df['text'] = df['text'].apply(lambda x: " ".join([x for x in x.split(" ") if "@" not in x])) # remove tags
df['text'] = df['text'].apply(lambda x: " ".join([x for x in x.split(" ") if "a" != x])) # remove a
df['text'] = df['text'].apply(lambda x: " ".join([x for x in x.split(" ") if "the" != x])) # remove the
df['text'] = df['text'].apply(lambda x: " ".join([x for x in x.split(" ") if "an" != x])) # remove an

In [6]:
df.sample(25)

Unnamed: 0,id,keyword,location,text,target
5744,8202,riot,,discovered by \n listen/buy riot on û_ blowma...,0
35,53,ablaze,"London, UK",on plus side look at sky last night it was ablaze,0
6675,9566,thunder,"Decatur, GA",brings her alabama thunder back to attic septe...,0
211,298,annihilated,"New York, NY",uribe just annihilated that baseball. mets,0
2411,3470,derailed,USA,note there were no passengers on board when tr...,1
6975,10005,tsunami,,all of this energy,0
5734,8183,rescuers,,video: 'we're picking up bodies from water': r...,1
6072,8674,sinkhole,San Diego California 92101,water main break disrupts trolley service sand...,1
4173,5928,harm,"Portland, OR",no harm no foul and somebody needed to say it.,0
5696,8129,rescued,,heroes! springer spaniel &amp; her dog dad res...,1


# We must turn text into mathematical representation
There are a number of ways to do this, the most simple of which is a count vectorizor, where we simply count the number of times a word shows up in the tweet.

In [7]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first single tweet in the data
example_train_vectors = count_vectorizer.fit_transform(df["text"][0:1])

## Notice there are 12 unique words in the tweet (after cleaning)

In [9]:
df.iloc[0]["text"]

'our deeds are reason of this earthquake may allah forgive us all'

In [10]:
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 12)
[[1 1 1 1 1 1 1 1 1 1 1 1]]


### Let's see what happens when we scale this up a bit 
We are now going to vectorize 20 tweets at once. This increases our __vocabulary__, which is an important concept in NLP.

In [11]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(df["text"][0:20])

In [12]:
df.iloc[4]["text"]

'just got sent this photo from ruby alaska as smoke from wildfires pours into school '

In [13]:
print(example_train_vectors[4].todense().shape)
print(example_train_vectors[4].todense())

(1, 145)
[[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0
  0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
  0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
  0]]


### Note there are 145 words in the "Vocabulary" of the embedding above. 
There are an infinite number of possible "embeddings". The particular embedding does not care about the order of the words.

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(pd.DataFrame(df['text']), pd.DataFrame(df['target']), test_size=0.33, random_state=42)


In [21]:
X_train.sample(3)

Unnamed: 0,text
3486,pyrotechnic artwork by cai guo-qiang explodes ...
3988,don't worry i'm sure climate has nothing to do...
6454,imagine school where suicide bombing is being ...


## Now that we have seen embeddings, let's embed the entire dataset.
This will create our "features" for our machine learning models to learn on. Machine learning is just statistical equations, so it makes sense to force all the inputs to be numbers. One thing we should call out is, the training dataset defines the size of the vocabulary. How might this effect things in unexpected ways?

In [23]:
train_vectors = count_vectorizer.fit_transform(X_train["text"])

## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - 
# i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(X_test["text"])

In [27]:
print(train_vectors[0].todense().shape)
print(train_vectors[0].todense())
print(f"Vocabulary is {train_vectors[0].todense().shape[1]} words!")

(1, 11751)
[[0 0 0 ... 0 0 0]]
Vocabulary is 11751 words!


# Now that we have our features
Let's train the actual model. For this we will use a very simply logistic regression model, which is the gold standard for simple binary classificaiton. 

In [30]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(train_vectors, y_train)

  y = column_or_1d(y, warn=True)


# Evaluate the performance on test set
In a production setting there is usually a lot more that goes into evaluating the efficacy of the model (mostly determining "if our test set a good representation of reality"), but for our pedagogical purposes this will do.

In [31]:
accuracy = model.score(test_vectors, y_test['target'])
print("Accuracy:", accuracy)

Accuracy: 0.8038201352964585


# We get about 80 percent accuracy
__If you notice this is better than the tutorial notebook we started from__: https://www.kaggle.com/code/philculliton/nlp-getting-started-tutorial . Try and see what I did differently to get a better score than the tutorial.

## Importance of embeddings

Notice that because of the numerical embeddings we can take __text__ data and turn it into a mathematical vector to perform statistical analysis via logistic regression. The concept of embeddings plays a crucial role in modern AI. Computers can process only mathemtical expressions, so finding creative ways to embed information that is not numerical can become an art. 

On top of training decision algorithms on the embeded vectors, there is also interesting transformations that can take place in the embdedded space. For example, if I give a human or an AI algorithm the phrase "A Red Apple with a Worm inside of it", both can embed the text data into something abstract, then extract a visual representation of the same information.

## Check out other methods at
https://www.kaggle.com/competitions/nlp-getting-started/code?competitionId=17777 
and bring cool ideas to the next CoP meeting. 