# Part 2: Text Mining

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

data = [ 'Now for manners use has company believe parlors.',
         'Least nor party who wrote while did. Excuse formed as is agreed admire so on result parish.',
         'Put use set uncommonly announcing and travelling. Allowance sweetness direction to as necessary.',
         'Principle oh explained excellent do my suspected conveying in.',
         'Excellent you did therefore perfectly supposing described. ',
         'Its had resolving otherwise she contented therefore.',
         'Afford relied warmth out sir hearts sister use garden.', 
         'Men day warmth formed admire former simple.',
         'Humanity declared vicinity continue supplied no an. He hastened am no property exercise of. ' ,
         'Dissimilar comparison no terminated devonshire no literature on. Say most yet head room such just easy. ']

# Create count vector
count_vectorizer = CountVectorizer()
count_vector = count_vectorizer.fit_transform(data)
print("Count Vector:\n", count_vector.toarray())

# Create tfidf vector
tfidf_vectorizer = TfidfVectorizer()
tfidf_vector = tfidf_vectorizer.fit_transform(data)
print("\nTF-IDF Vector:\n", tfidf_vector.toarray())

Count Vector:
 [[0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
  0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
  0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0]
 [0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
  0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 

## The usage of TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a weighting scheme that accounts for how important a word is in a document corpus. It's a way to score the importance of words (or terms) in a document based on how frequently they appear across multiple documents. If a word appears frequently in a document, but not in many documents, it is considered to be important. It consists of two parts:

Term Frequency (TF): The count of how many times the word occurs in the document. Common words will have a high TF.

Inverse Document Frequency (IDF): A downscaling factor based on how many documents in the corpus contain the word. Words appearing in many documents have a lower IDF.

By multiplying TF and IDF, we get the TF-IDF weight. It allows highlighting words that are distinct or characteristic for a particular document, while downplaying words that are very common.
TF-IDF vectors are commonly used as features for machine learning on text, as they better represent the importance of words than just frequencies.