# Introduction to Natural Language Processing

## Tutorial 3

In this exercise, we will learn to extract features using Pandas and Scikit Learn.

In [2]:
import pandas as pd
import numpy as np

ModuleNotFoundError: No module named 'pandas'

## Pandas Series

Pandas has two data structures that we will consider in this class, `Series` and `DataFrame`. Let's take a closer look at `Series`. At first glance, it's like playing around with a `list`. We already know what a list is and why this data structure is relevant. `Series` is also very similar, but Pandas allows index naming, making everything much easier to read.

In [3]:
list1 = "This is the first document".split(" ")

In [4]:
print(list1)

['This', 'is', 'the', 'first', 'document']


In [9]:
my_series = pd.Series(list1)

NameError: name 'pd' is not defined

In [None]:
my_series

In [None]:
my_dict = {"this":0, "is":1, "the":2, "first":3, "document":4}
my_index = [0, 1, 2, 3, 4]

In [None]:
series1 = pd.Series(my_dict)

In [None]:
series1

In [None]:
list2 = "this document is the second document".split(" ")
series2 = pd.Series(data=[0, 1, 2, 3, 4, 5], index = list2)

In [None]:
series2

## DataFrame

Now, let's take a look at what is a DataFrame. We can define it as several Series units that share the same index since we already know the Series data structure. Here, we will use Numpy to create a random matrix having setting also a common seed. Do you know why?

In [None]:
from numpy.random import randn
np.random.seed(123)

In [None]:
df1 = pd.DataFrame(randn(5,4), index=[0, 1, 2, 3, 4], columns="A B C D".split(" "))

In [None]:
display(df1)

### Indexing DataFrames

Here things begin to turn a bit different. If we want to index one column, then we just call it by its name, but if we want several columns, we will need to give them as a list.

In [None]:
df1["B"]

In [None]:
df1[["C", "D"]]

In [None]:
df1["E"] = df1["A"] * df1["D"]

In [None]:
df1

In [None]:
df1.drop("E", axis=1, inplace=True)

In [None]:
df1

### loc vs iloc

Pandas gives us the posibility to locate elements, either by index or by name. Therefore, `iloc` requires indices and `loc` requires names. Note that indexing for the matrix works like this `[[first rows], [second colums]]`

In [None]:
df1.iloc[[0,1,3],[1]]

In [None]:
df1.loc[0]

In [None]:
df1

### Apply

Pandas allows to apply a function to a `Series`, which might be sometimes super useful. Let's take a look at that.

In [None]:
corpus = ["This is the first document",
           "This document is the second document",
           "And this is the third one", 
           "Is this the first document"]

In [None]:
df2 = pd.DataFrame(corpus, columns=["text"])

In [None]:
df2

In [None]:
def count_words(any_string):
    return len(any_string.split(" "))

In [None]:
# Add a new column to DF >> "count_words"
# Your code comes here
df2["count_words"] = df2["text"].apply(count_words)

In [None]:
df2

### Visualization

In [None]:
# Bar plot
# Your code comes here
df2.plot.bar(x="text", y="count_words")

### Scikit-Learn: Understanding CountVectorizer

The CountVectorizer is specifically used for counting words. The vectorizer part of CountVectorizer is (technically speaking!) the process of converting text into some sort of number-y thing that computers can understand.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Build the text
text = """The CountVectorizer is specifically used for counting words.
The vectorizer part of CountVectorizer is (technically speaking!) the process of converting text into some sort of number-y
thing that computers can understand."""

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform([text])
matrix

In [None]:
matrix.toarray()

In [None]:
print(vectorizer.get_feature_names_out())

In [None]:
counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names_out())

In [None]:
counts

In [None]:
# Sort the DF and show the top 10 most common words
counts.T.sort_values(by=0, ascending=False).head(10)

In [None]:
import requests

# Download the book
response = requests.get('http://www.gutenberg.org/cache/epub/42671/pg42671.txt')
text = response.text

# Look at some text in the middle
print(text[4101:4600])

In [None]:
# How often have the words "love" and "hate" been used in the book?
#your code comes here
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform([text])
counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names_out())

In [None]:
counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names_out())
counts
print(counts['love'])
print(counts['hate'])