# Session 1 - Data representation from text

In this session, we will learn the basics of data representation. You will discover several techniques to represent a document into something that a model can process—learning from the primary count vectorizer to the more complex Tfidf, their differences, and what makes each of them better for a specific use case. We will also cover some data cleaning approaches to avoid treating noise or unnecessary information in our observations.

Keywords: *tokenization*, *stop words*, *feature representation*, *count vectorizer*, *tfidf*, *regex*, *feature selection*

## Overview of the project and the data
This project stage focuses on building a classification model to assign a company to its industry using the company description scraped online. We will be discovering how to represent the information to be understood by the machine quickly and find the best method to gain accuracy. You will develop an independent pipeline that will take text as input and give you the closer industry related to your input information during this stage.

Let's first look at the dataset that we will be using during this whole course using a famous library called [Pandas](https://www.educba.com/what-is-pandas/):

In [13]:
#Let's import the library
import pandas as pd #We define an alias for future usage of the library
import re #We will use regex to clean our text
from collections import Counter

In [None]:
#We will import and read our dataset using pandas
dataset = pd.read_csv("")

In [None]:
#Let's now read look at a sample of the data
dataset.head()

There are **2** features in this dataset:
The name of the company
The description of the company. We assumed that we cleaned this text, so no unnecessary information is present in the text. If you want to learn more about this step of cleaning data, refer to stage 1 of this course.

We still have unnecessary information stored in this text. To remove it, we filter and clean the text using regular expressions. Short for regular expression, a regex is a string of text that allows you to create patterns that help match, locate, and manage text.

Here is an example:

![Regular expression](https://www.computerhope.com/jargon/r/regular-expression.gif)

In our case, we don't want to use numbers or special characters because this information can be utilized in any industry and doesn't seem to be a strong marker of knowledge. Let's look at a simple example for the description of Apple

In [3]:
description = '''
Apple Inc. is an American multinational technology company that specializes in consumer electronics, computer software, and online services. Apple is the world's largest technology company by revenue (totaling $274.5 billion in 2020) and, since January 2021, the world's most valuable company. As of 2021, Apple is the world's fourth-largest PC vendor by unit sales,[9] and fourth-largest smartphone manufacturer.[10][11] It is one of the Big Five American information technology companies, along with Amazon, Google, Microsoft, and Facebook.[12][13][14]

Apple was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976 to develop and sell Wozniak's Apple I personal computer. It was incorporated by Jobs and Wozniak as Apple Computer, Inc. in 1977, and sales of its computers, including the Apple II, grew quickly. They went public in 1980 to instant financial success. Over the next few years, Apple shipped new computers featuring innovative graphical user interfaces, such as the original Macintosh, announced with the critically acclaimed advert "1984". However, the high price of its products and limited application library caused problems, as did power struggles between executives. In 1985, Wozniak departed Apple amicably,[15] while Jobs resigned to found NeXT, taking some Apple co-workers with him.[16]

As the market for personal computers expanded and evolved through the 1990s, Apple lost considerable market share to the lower-priced duopoly of Microsoft Windows on Intel PC clones. The board recruited CEO Gil Amelio, who prepared the struggling company for eventual success with extensive reforms, product focus and layoffs in his 500 day tenure. In 1997, Gil bought NeXT, to resolve Apple's unsuccessful OS strategy and bring back Steve Jobs, who replaced Amelio as CEO later that year. Apple returned to profitability under the revitalizing "Think different" campaign, launching the iMac and iPod, opening a retail chain of Apple Stores in 2001, and acquiring numerous companies to broaden their software portfolio. In 2007, the company launched the iPhone to critical acclaim and financial success. In 2011, Jobs resigned as CEO due to health complications, and died two months later. He was succeeded by Tim Cook.

In August 2018, Apple became the first publicly traded U.S. company to be valued at over $1 trillion[17][18] and the first valued over $2 trillion two years later.[19][20] It has a high level of brand loyalty and is ranked as the world's most valuable brand; as of January 2021, there are 1.65 billion Apple products in use worldwide.[21] However, the company receives significant criticism regarding the labor practices of its contractors, its environmental practices, and business ethics, including anti-competitive behavior, and materials sourcing. 
'''
description

'\nApple Inc. is an American multinational technology company that specializes in consumer electronics, computer software, and online services. Apple is the world\'s largest technology company by revenue (totaling $274.5 billion in 2020) and, since January 2021, the world\'s most valuable company. As of 2021, Apple is the world\'s fourth-largest PC vendor by unit sales,[9] and fourth-largest smartphone manufacturer.[10][11] It is one of the Big Five American information technology companies, along with Amazon, Google, Microsoft, and Facebook.[12][13][14]\n\nApple was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976 to develop and sell Wozniak\'s Apple I personal computer. It was incorporated by Jobs and Wozniak as Apple Computer, Inc. in 1977, and sales of its computers, including the Apple II, grew quickly. They went public in 1980 to instant financial success. Over the next few years, Apple shipped new computers featuring innovative graphical user interfaces, such as th

In [8]:
description = re.sub("[^A-Za-z]+", " ", description) #Replace all element in the text that are not letters
description

' Apple Inc is an American multinational technology company that specializes in consumer electronics computer software and online services Apple is the world s largest technology company by revenue totaling billion in and since January the world s most valuable company As of Apple is the world s fourth largest PC vendor by unit sales and fourth largest smartphone manufacturer It is one of the Big Five American information technology companies along with Amazon Google Microsoft and Facebook Apple was founded by Steve Jobs Steve Wozniak and Ronald Wayne in to develop and sell Wozniak s Apple I personal computer It was incorporated by Jobs and Wozniak as Apple Computer Inc in and sales of its computers including the Apple II grew quickly They went public in to instant financial success Over the next few years Apple shipped new computers featuring innovative graphical user interfaces such as the original Macintosh announced with the critically acclaimed advert However the high price of its

## How do you represent text for machine learning?
We need to transform the text to make it understandable by a machine learning algorithm.

- We will start by converting the text as a sequence of words, also known as tokens

In [49]:
review_1 = "I really liked this Movie"
review_2 = "We did not like this movie"
review_3 = "i used to like these action movies"


review_1.split(" ")


['I', 'really', 'liked', 'this', 'Movie']

- The machine learning algorithm can't understand the information as is it because there are unlimited number of word it will have to learn from. That's why we need to create a custom dictionnary to store each word and transform each sentence as a vector that will count our token. To create that dictionnary, we can use a set that will store new word as soon as they are read by the system:

In [50]:
vocabulary = set()
vocabulary.update(review_1.split(" "))
vocabulary


{'I', 'Movie', 'liked', 'really', 'this'}

Here, you can see that we updated our vocabulary with new token from the first review, let's keep updating it with new token from the other reviews.

In [51]:
vocabulary.update(review_2.split(" "))
vocabulary.update(review_3.split(" "))
print(f"The size of the vocabulary is {len(vocabulary)} tokens.")
print(f"Here are the token that composed it: {vocabulary}")

The size of the vocabulary is 16 tokens.
Here are the token that composed it: {'action', 'I', 'these', 'We', 'movie', 'really', 'movies', 'Movie', 'did', 'used', 'like', 'not', 'i', 'liked', 'this', 'to'}


Here, you can see that each new token observed from the system is added to the vocabulary.

## Why do we need to clean our text?

When creating our vocabulary, we can noticed that a lot of words with the same meaning have been added as discting ententies even if they are the same type of information.

Let's try to reduce the vocabulary so words of the same meaning can be represented by the same index in our dictionary.

- First let's look at the capital letters, Movie and movie are consider as discting piece of information even if they mean the same thing.

In [52]:
# For simplicity, let's regroup each review in a list
reviews = [review_1, review_2, review_3]
reviews

['I really liked this Movie',
 'We did not like this movie',
 'i used to like these action movies']

In [53]:
reviews = [review.lower() for review in reviews]
reviews

['i really liked this movie',
 'we did not like this movie',
 'i used to like these action movies']

In [54]:
#Let's create a new vocabulary with these reviews
vocabulary = set()
for review in reviews:
    vocabulary.update(review.split(" "))
print(f"The size of the vocabulary is {len(vocabulary)} tokens.")
print(f"Here are the token that composed it: {vocabulary}")

The size of the vocabulary is 14 tokens.
Here are the token that composed it: {'action', 'these', 'we', 'movie', 'really', 'movies', 'did', 'used', 'like', 'not', 'i', 'liked', 'this', 'to'}


As you can see; updating all the reviews to be lower case helped us to reduce our vocabulary size. Make it more optimal for the model we will be building.

## Which other transformation can be done on the data to optimize the vocabulary?

- Let's 

In [None]:
## Transforming text into features