# Lecture Notes

<h2>Text Classification</h2>

**Humans decide what categories to use for text classification, then we train AI to do the labelling.**

The first kind of model that we’ll talk about is **text classification**. This fills one of the core slots in a pipeline, between data collection and visualization. We start by deciding what categories are important to us. This means that we create a complete classification system, in which each text belongs to one or another category. If I want to sort tweets by language, I first come up with examples of all the languages I am interested in. Some categories might be quite large (a majority class, like English) while others are quite small (a minority class, like Samoan).

Then we train a classifier to automate the labelling task. Labelling here means assigning each text to the correct category. If a tweet is written in Samoan, I want the classifier to label it as Samoan. The goal, of course, is to automate labelling so that we can know the categories of millions or billions of texts. Training here means that we show the classifier examples with their correct labels until the machine is able to make accurate predictions on its own.

Let’s break down the problem of text classification.



1.  We need to design a category system. If I want to sort hotels by the kind of guests they cater to, I need to make up a set of hotel types: luxury vs. business vs. family vs. budget.

2.  we need to represent language to focus on a particular kind of meaning. If I’m categorizing hotels, I want to focus on the topics that reviewers talk about: service, sleep quality, cleanliness, amenities. I need to make sure that the classifier captures these domains.

3. We need to train and evaluate a classification model. This means I take hotel reviews from each category, show them to the classifier, and help it learn to distinguish between types of hotels. Then I test the classifier on examples it hasn’t seen before. How many does it get right?

4. We’ll talk about how to annotate new datasets while ensuring that our labels are consistent. If we can’t agree, as humans, about what kinds of hotels there are, then we can’t teach a machine to do it for us.

* Kinds of Categories. We can design categories that are based either on a: **document’s content, or its sentiment or its author**. Content is the meaning of the text, what it talks about: sports or politics or a new television show. Sentiment is the tone or emotional content of the text. For example, we could have an angry article about a rugby match and a joyous article; sometimes we need to know which is which. An author is the person who wrote the text: a middle-age American woman or a young Japanese man, for example. All this information, content and sentiment and author, are present in language. But they are contained in different parts of the text. We’ll explore how to represent language so that we focus on one or the other.

* Supervised Learning. At its core, we as humans define this problem by deciding in advance what the categories will be. Some category systems are scientifically valid: for example, we know that we can identify the dialect or native language of a document’s author. But other category systems are not valid: for example, we can’t know what social clubs the author belongs to or what their favorite food is. We must establish a good justification for the categories we propose.

# Colab Setup

In [None]:
# if you are running these labs in CoLab, you will first need to mount the drive and 
# copy text_analitics.py to path 

from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
###Add text_analytics.py to path 
!cp "/content/drive/My Drive/Colab Notebooks/CourseWork/Text Analytics and Natural Language Processing/text_analytics.py" .
print("Done!")

Done!


# Lecture Lab

Today we're going to look at x and y arrays, something that we'll need for training our text classifiers later.

Basically, for our dataset we have two essential pieces of information: First, the texts that we're working with. Second, the category that each text belongs to. We call our data that we use (texts) the *x array*. And we call the list of labels that each belongs to the *y array*. For example, in *scikit-learn* the convention is to talk about X and y. This just means the language and the labels!

In [None]:
from text_analytics import text_analytics
import os
import pandas as pd

ai = text_analytics()
print("Done!")

Done!


This time we're going to work with tweets that represent different cities around the world. So we first load this data into memory.

In [None]:
file = os.path.join(ai.data_dir, "Twitter_by_City.gz")
df = pd.read_csv(file, index_col = 0)
print(df)
print("Done!")

            City                                               Text
0       auckland   i m sorry who didn t vote beige show me these...
1       auckland   meet our analysts idc australia new zealand a...
2       auckland   aue my great grandfather was an aboriginal te...
3       auckland   yeah dumping corsica would be a good idea tot...
4       auckland   thank you for your partnership to help get de...
...          ...                                                ...
373123   toronto   will hopefully be great and all but this is j...
373124   toronto   leafs defenceman jake muzzin out for remainde...
373125   toronto   read into everything going on in lebanon not ...
373126   toronto   remy i hate this debate he s a street rat he ...
373127   toronto   brutal other day i lost the bucks first quart...

[373128 rows x 2 columns]
Done!


Ok, so now we have our data. Instead of a full table or dataframe like this, we want to separate arrays.

In [None]:
x = df.loc[:,"Text"]
y = df.loc[:,"City"]
print("Done!")

Done!


So, later, when we use *scikit-learn* or *tensorflow* to build models, we can use this syntax to get our respective x and y arrays. Let's see what they look like.

In [None]:
print(x)
print(y)
print("Done!")

0          i m sorry who didn t vote beige show me these...
1          meet our analysts idc australia new zealand a...
2          aue my great grandfather was an aboriginal te...
3          yeah dumping corsica would be a good idea tot...
4          thank you for your partnership to help get de...
                                ...                        
373123     will hopefully be great and all but this is j...
373124     leafs defenceman jake muzzin out for remainde...
373125     read into everything going on in lebanon not ...
373126     remy i hate this debate he s a street rat he ...
373127     brutal other day i lost the bucks first quart...
Name: Text, Length: 373128, dtype: object
0         auckland
1         auckland
2         auckland
3         auckland
4         auckland
            ...   
373123     toronto
373124     toronto
373125     toronto
373126     toronto
373127     toronto
Name: City, Length: 373128, dtype: object
Done!


You'll notice that each array is the same length: 373,128 items. Each item here is actually an aggregation of about 25 tweets. That gives us more data per sample, which is a lot easier to work with.

And that's all for today! Here we've learned how to make separate arrays that represent our data and our meta-data.

Try to use the code block below to x and y arrays for a new data set, this time using "Author" as the label:

    "Gutenberg.1850.Authors.gz"

# Practice

In [None]:
file = os.path.join(ai.data_dir, "Gutenberg.1850.Authors.gz")
df = pd.read_csv(file, index_col = 0)

# Pandas formating 
# inclears number of columns to display in line and column with for easier reading 
pd.set_option("display.max_columns", 10)
pd.set_option('display.width', 1000)

print(df)
print("Done!")

        index  Generation    Author                Title                                               Text
0         150        1850  abbott_j  alexander_the_great  note project gutenberg also has an html versio...
1         151        1850  abbott_j  alexander_the_great  it will be recollected to epirus where her fri...
2         152        1850  abbott_j  alexander_the_great  it would be best to endeavor to effect a landi...
3         153        1850  abbott_j  alexander_the_great  transport his army across the straits the army...
4         154        1850  abbott_j  alexander_the_great  that the true greatness of the soul of alexand...
...       ...         ...       ...                  ...                                                ...
17042  125483        1850    wood_h       victor_serenus  since i have been with amabel it hath waxed st...
17043  125484        1850    wood_h       victor_serenus  his face uttered a loud cry and shrank back af...
17044  125485        1850   

In [None]:
x = df.loc[:,"Text"]
y = df.loc[:,"Author"]

print("Text, X")
print("")
print(x)
print("")
print("Labels, Y")
print("")
print(y)

Text, X

0        note project gutenberg also has an html versio...
1        it will be recollected to epirus where her fri...
2        it would be best to endeavor to effect a landi...
3        transport his army across the straits the army...
4        that the true greatness of the soul of alexand...
                               ...                        
17042    since i have been with amabel it hath waxed st...
17043    his face uttered a loud cry and shrank back af...
17044    the taurus mountains made the afternoon balmy ...
17045    me to a place not very far distant where all m...
17046    never knew these things before now thou dost r...
Name: Text, Length: 15999, dtype: object

Labels, Y

0        abbott_j
1        abbott_j
2        abbott_j
3        abbott_j
4        abbott_j
           ...   
17042      wood_h
17043      wood_h
17044      wood_h
17045      wood_h
17046      wood_h
Name: Author, Length: 15999, dtype: object


In [None]:
x.loc[[1]]

1    it will be recollected to epirus where her fri...
Name: Text, dtype: object