# Natural Language Processing

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

# Table of Contents
1. [Representing text as numerical data](#rtand)
2. [Reading a text-based dataset into pandas](#reading)
3. [Vectorizing our dataset](#vectorzing)
4. [Building and evaluating a model](#eval)
5. [Comparing models](#comp)
6. [Examining a model for further insight](#exam)
7. [Practicing this workflow on another dataset](#pract)
8. Tuning the vectorizer (discussion)

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Ignoring warnings
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
# sns.set_style("whitegrid")
# plt.style.use("fivethirtyeight")

### Representing text as numerical data <a class = 'anchor' id = 'rtand'></a>


In [12]:
simple_train = ['call you tonight', 'Call me a cab', 'Please call me... PLEASE!']

In [13]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
vect = CountVectorizer()
vect

CountVectorizer()

In [15]:
# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(simple_train)

CountVectorizer()

In [16]:
# examine the fitted vocabulary
vect.get_feature_names_out()

array(['cab', 'call', 'me', 'please', 'tonight', 'you'], dtype=object)

In [17]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [18]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]], dtype=int64)

In [19]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0
