### CountVectorizer
The purpose of `CountVectorizer` from the `sklearn.feature_extraction.text` module is to convert a collection of text documents into a matrix of token counts (also known as a "document-term matrix"). In simple terms, it transforms text data into a numerical format that machine learning algorithms can understand.

Here’s a breakdown of its key functionality:

1. **Tokenization**: It splits text into individual tokens (usually words), which are used as features for further analysis.
2. **Vectorization**: It converts the tokens into numeric vectors. Each document is represented as a vector of word frequencies or occurrences in that document.
3. **Feature Extraction**: The tokens are treated as features, allowing text to be used in machine learning models, such as classifiers.

For example, if you have two documents:
- "I love programming"
- "I love machine learning"

`CountVectorizer` will:
- Create a vocabulary: ['I', 'love', 'programming', 'machine', 'learning']
- Create a document-term matrix (each row represents a document, and each column represents a word):
  
| Document  | I | love | programming | machine | learning |
|-----------|---|------|-------------|---------|----------|
| Doc 1     | 1 | 1    | 1           | 0       | 0        |
| Doc 2     | 1 | 1    | 0           | 1       | 1        |

This matrix shows how frequently each word appears in each document.

##### Common Parameters:
- `max_features`: Limits the number of features (words) to the most frequent ones.
- `stop_words`: Ignores common words like "the", "and", etc., which may not be useful for analysis.
- `ngram_range`: Allows for the use of n-grams (pairs or triples of words), rather than just single words.

In summary, `CountVectorizer` is a simple but effective tool for transforming text data into a format suitable for machine learning tasks like text classification or clustering.

### Cosine similarity
Cosine similarity is a metric used to measure how similar two vectors (usually in a multi-dimensional space) are to each other. It's commonly used in fields like natural language processing, information retrieval, and machine learning to compare documents or texts. 

The basic idea behind cosine similarity is to calculate the cosine of the angle between two vectors. If the vectors are close to each other in the space, the cosine of the angle between them will be close to 1, meaning the vectors are very similar. If they are at 90 degrees (orthogonal), the cosine will be 0, meaning there is no similarity. If they are completely opposite, the cosine will be -1.

See this video to completely understand whhat is cosine similarity: https://www.youtube.com/watch?v=e9U0QAFbfLI

![image.png](attachment:image.png)

A **sparse matrix** is a matrix in which most of the elements are zero. In other words, the number of non-zero elements is significantly smaller than the total number of elements in the matrix. Sparse matrices are often encountered in areas like scientific computing, machine learning, and graph theory, where they provide efficient storage and computation, as storing all elements of a sparse matrix would be wasteful.

### Example:

Consider the following 5x5 matrix:

\[
\begin{bmatrix}
0 & 0 & 0 & 0 & 3 \\
0 & 0 & 0 & 0 & 0 \\
0 & 4 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 \\
0 & 0 & 5 & 0 & 0
\end{bmatrix}
\]

This matrix has 25 elements, and of these, only 3 elements are non-zero (3, 4, and 5). Therefore, it is a sparse matrix. Instead of storing all 25 elements, we can store just the non-zero ones (along with their row and column indices) in a more compact form. This reduces memory usage.

### Compact Storage:
One way to store this sparse matrix efficiently is to use **coordinate list (COO)** format or **compressed sparse row (CSR)** format, which only stores the non-zero values and their respective positions.

For example, using the COO format, the matrix can be represented as:

- Non-zero values: `[3, 4, 5]`
- Row indices: `[0, 2, 4]`
- Column indices: `[4, 1, 2]`

This is a much more efficient representation because it only stores the non-zero values along with their positions, instead of the entire matrix.