# In the last class we saw how to analyse the sentiments using pretrained models. Lets see how to train your own model

* another quick crash course on machine learning using SKLEARN
* your task: 
* Tweets about american arlines - based on the old tweets the airline wants to know if the tweet is positive or negative

### Text classification using machine learning

* the power of this methods is that its purely statistical, can be sentences with errors, misspelled etc

Now that we’ve looked at some of the cool things spaCy can do in general, let’s look at at a bigger real-world application of some of these natural language processing techniques: text classification. Quite often, we may find ourselves with a set of text data that we’d like to classify according to some parameters (perhaps the subject of each snippet, for example) and text classification is what will help us to do this.

The diagram below illustrates the big-picture view of what we want to do when classifying text. First, we extract the features we want from our source text (and any tags or metadata it came with), and then we feed our cleaned data into a machine learning algorithm that do the classification for us.

![](imgs/process.png)

## Procedure
* cleaning data (remove repeated rows etc, here we assume the dataset is clean)
* understanding data (plots etc, skip this part for today only)
* identifying columns that will constitute features and  data (features and labels)
* make `features` list: list of senteces (.values or .to_numpy())
* make `labels` list: list of labels (negative/positive) (.values or .to_numpy())
* vectorize the sentences (bag of words or TfidfVectorizer)
  https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
* split into X_train, X_test, y_train, y_test
  https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

```  
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(processed_features_vec, labels, test_size=0.2, random_state=0)
```

* load classifier, train the model and check the accuracy (why? not regressor)
  https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble

```  
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(accuracy_score(y_test, predictions))  
```
*we will use the Random Forest algorithm, owing to its ability to act upon non-normalized data.

### some simple cleaning routine that you started during the last class 

In [2]:
processed_features = []

for i in range(0, len(features)):
    print
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', str(features[i]))
    processed_feature = processed_feature.lower()
    ### your other claning lines

    processed_features.append(processed_feature)

NameError: name 'features' is not defined

In [321]:
processed_features[0]

' virginamerica what dhepburn said '

In [322]:
len(processed_features)

14640

# Machine feeds on numbers so we should convert the words into numbers

* spacy can do it
* but scikit-learn can do it better

In [323]:
mango = nlp(u'mango')
print(mango.vector.shape)
print(mango.vector)

(96,)
[-0.70611125 -1.4329455   0.24227941  0.6598132  -0.20285606 -0.3363567
 -1.4245116  -0.11146422 -0.56221646  0.3003068  -0.19000328 -0.08635545
  1.3099948   1.379954    0.02685246  1.5109322  -0.733334    0.80945534
  0.29014212 -0.2684864  -0.7413073  -0.7534003   1.52542    -0.61603916
  0.3729881   0.31268534 -0.68583065 -0.75191927  0.58086497 -1.0955321
  0.86638093 -1.9158285  -0.05129784 -0.20604798  0.2827754  -2.019856
 -0.0126412   0.3666329  -1.2550778   1.6548673  -0.85672385 -0.9216615
  0.2952034   0.01230198 -0.42903078 -0.4966709  -0.25612807 -1.3058071
  1.8100011   0.51152885  0.03403987  0.70565414  0.42585516 -0.8349808
  0.5538808   0.57170147 -1.101404    0.33620203  0.07782254  0.5464119
 -0.06026481 -0.5734616   0.6843033  -1.0217375  -0.11573818 -0.93082213
 -0.85589534  0.5505712   1.3896189  -0.5574837   0.19777809  0.3153283
 -0.37644464  0.38533548  0.02513826 -0.293028   -0.23319107  0.8843169
  0.61514205 -1.189681    1.3120099   0.49911803 -0.060

# Representing Text in Numeric Form

Statistical algorithms use mathematics to train machine learning models. However, mathematics only work with numbers. To make statistical algorithms work with text, we first have to convert text to numbers. To do so, three main approaches exist i.e. Bag of Words, TF-IDF and Word2Vec. In this section, we will discuss the bag of words and TF-IDF scheme.
Bag of Words

### Bag of words scheme is the simplest way of converting text to numbers.

For instance, you have three documents:

    Doc1 = "I like to play football"
    Doc2 = "It is a good game"
    Doc3 = "I prefer football over rugby"

In the bag of words approach the first step is to create a vocabulary of all the unique words. For the above three documents, our vocabulary will be:

* bag of words:

Vocab = [I, like, to, play, football, it, is, a, good, game, prefer, over, rugby]

* For instance, for Doc1, the feature vector will look like this:

[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

#### Questions:
* are words equally important?
* if not which ones are more important (frequent or less frequent)?

#### TF-IDF

In the bag of words approach, each word has the same weight. The idea behind the TF-IDF approach is that the words that occur less in all the documents and more in individual document contribute more towards classification.

TF-IDF is a combination of two terms. Term frequency and Inverse Document frequency. They can be calculated as:

TF  = (Frequency of a word in the document)/(Total words in the document)

IDF = Log((Total number of docs)/(Number of docs containing the word))

#### Scikit-Learn has such tool that can vectorize the words:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [324]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [325]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
 ]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
#vectorizer.get_feature_names_out()
print(vectorizer.get_feature_names())
print(X.shape)
print(f'{X.toarray()}:.2f')

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]:.2f


### lets make it simpler by removing the stop words

In [10]:
import spacy
nlp = spacy.load("en_core_web_sm")
spacy_stopwords_en = spacy.lang.en.stop_words.STOP_WORDS

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
 ]
vectorizer = TfidfVectorizer(stop_words=spacy_stopwords_en)
X = vectorizer.fit_transform(corpus).toarray()
#vectorizer.get_feature_names_out()
vectorizer.get_feature_names()
print(X)

[[1.         0.        ]
 [0.78722298 0.61666846]
 [0.         0.        ]
 [1.         0.        ]]


  'stop_words.' % sorted(inconsistent))


In [14]:
#vectorizer.get_feature_names_out()
print(vectorizer.get_feature_names())
print(X.shape)
print(X)

['document', 'second']
(4, 2)
[[1.         0.        ]
 [0.78722298 0.61666846]
 [0.         0.        ]
 [1.         0.        ]]


In [6]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
 ]
vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))
X = vectorizer.fit_transform(corpus).toarray()
#vectorizer.get_feature_names_out()
vectorizer.get_feature_names()
print(X)

[[0.62922751 0.77722116 0.         0.         0.        ]
 [0.78722298 0.         0.         0.61666846 0.        ]
 [0.         0.         0.70710678 0.         0.70710678]
 [0.62922751 0.77722116 0.         0.         0.        ]]


# Where doest the discrepancy come from?
* nltk works better for english
* spacy (aparently) works better for spanish 

* but do not worry, they are both super easy to work with

# Apply all the steps to the dataset

In [None]:
data_source_url = "https://raw.githubusercontent.com/mhemmg/datasets/master/nlp/airline_tweets.csv"
airline_tweets = pd.read_csv(data_source_url)

#### extract features and labels

#### clean features (sentences in the list of features)

#### vectorize the sentences using `TfidfVectorizer`

#### Dividing Data into Training and Test Sets

#### Load the library (RandomForest) and train the data

#### try some sentence (remember to convert it first using `vectorizer.transform(test_features).toarray()`)