# 1. Working With Word Vectors
In the introduction section we went over some very basic NLP techniques, and more importantly gained an understanding for how to apply _basic_ machine learning models and APIs to the a problem revolving around text data. 

One of the things that we went over was the **term-document matrix**, and how it could be utilized in our process of converting text data to numerical data, so a model can understand it. Well, one of the things that we did not discuss is there is a slight problem with that method, namely its simple counting process. One of the things that tends to happen is that words such as "a", "the", "and", "in", "to", etc, have a high count for ALL documents, no matter what the category is! This is a very large amount of noise that will often overshadow the meaningful words. 

These words are known as **stopwords** and one common technique is to just remove them from the dataset before doing any machine learning. 

### 1.1 TF-IDF
However, there is another technique that we can utilize: **Term Frequency-Inverse Document Frequency**, (TF-IDF). We will not go into the full details of TF-IDF, however, the jist is as follows: We know that words that appear in many documents are probably less meaningful. With this in mind, we can weight each vector component (in this case a word) by something related to how many documents that word appears in. So, intuitively speaking, we may do something like:

$$\frac{\text{raw word count}}{\text{document count}}$$ 

So, the numerator tells us _how many times does this word appear in this document_, and the denominator tells us _how many documents does this word appear in, in total_. Now, in practice we do some transformations on these, like taking the log count, smoothing, and so on. However, the specific implementation isn't nearly as important for this course as is the general understand behind the process. 

### 1.2 Key Point
One of the most important things to keep in mind during all subsequent posts is that no matter what technique we are using, we are always interested in a matrix of size $(V x D)$, where $V$ is the vocabulary size (the number of total words), and $D$ is the vector dimensionality, which is we are doing something like counting up the total number of times a word appears in a set of books, $D$ is the total number of books. 

### 1.3 Word Embeddings
A final thing to note, we are going to encounter the term **Word-Embedding** quite a bit. This is just a fancy word for an old and relatively straight forward concept. A word-embedding is just a fancy name for a feature vector that represents a word. In other words, we can take a categorical object-a word in this case-and then map this object to a list of numbers (in other words, a vector). We say that we have embedded this word into a vector space, and that is why we call them word embeddings. 