# Demo 19

In [None]:
import pandas as pd
import sklearn
import numpy as np

import matplotlib.pyplot as plt

## Dataset - Obits from HW02

Now lets look at using kmeans to cluster documents

Load in data. This takes a little while.

In [None]:
df = pd.read_csv("data/tfidf_hw02.csv.gz", compression="gzip")
df.shape

In [None]:
df.index

In [None]:
df.head(5)

In [None]:
df.index = df['subject']
df

In [None]:
df = df.drop(columns=['subject'])
df

Let's store the dataframe in a new numpy array called X

In [None]:
X = df.to_numpy()
X.shape

In [None]:
X[-1], df.index[-1]

### Sparsity

In [None]:


values, counts = np.unique(X, return_counts=True)
counts, values

Most common value is 0

### Size

In [None]:
X.shape

In [None]:
!ls -lah data/tfidf_hw02.csv.gz

2 MB doesnt sound like a lot, but thats because it is compressed and vocab is only 35K.

(back to slides)
## SVD

SVD in Sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

In [None]:
from sklearn.decomposition import TruncatedSVD

In [None]:
svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
svd

**Question:** what sklearn method do you think we can use to *train* the model? Here, train means learn the decomped matrices.

<details>
<summary>Solution</summary>
    .fit()
</details>

In [None]:
# skip

In [None]:
svd.fit(X)

### Reduce Dimensions

**Question:** what sklearn method do you think we can use to perform dimensionality reduction on X?
<details>
<summary>Solution</summary>
    .transform()
</details>

In [None]:
# skip

In [None]:
U = svd.transform(X)
U.shape

In [None]:
U

#### Sparsity

In [None]:
values, counts = np.unique(U, return_counts=True)
counts, values

No more 0's

#### Size

In [None]:
U.shape

In [None]:
pd.DataFrame(U).to_csv("data/reduced_tfidf.csv")

In [None]:
!ls -lah data/reduced_tfidf.csv

**Question:** Is this file much smaller than the compressed tf-idf version?

### Singular Values

Remember $s_{1} > s_{2} > \ldots > s_{n} $

In [None]:
S = svd.singular_values_
S

### Components

***V***-matrix ndarray of shape (n_components, n_features)

In [None]:
V = svd.components_
V

In [None]:
V.shape

For the first component, let's figure out the features that have the highest values

In [None]:
V_[0].argsort()

In [None]:
V[0].argsort()[:5]

In [None]:
df.columns[[V.argsort()[:5]]]

Now let's find the features that are most indicative of each components

In [None]:
for k, row in enumerate(V):
    print(f"Components {k}\t", df.columns[[row.argsort()[:5]]])

## Plotting documents

In [None]:
svd = TruncatedSVD(n_components=2, n_iter=7, random_state=42)
U = svd.fit_transform(df)

In [None]:
new_df = pd.DataFrame(U)
new_df

In [None]:
new_df.index = df.index
new_df

In [None]:
new_df.plot.scatter(x=1, y=0)

### Train Kmeans model

**Question:** What function do we think we can use to train the model?

<details>
<summary>Hint</summary>
    What function did we use yesterday to train the Naive Bayes and Logistic Regression classifiers
</details>

<details>
<summary>Solution</summary>
    .fit()
</details>

In [None]:
from sklearn.cluster import KMeans

kmeans_model = km = KMeans(n_clusters=5)
kmeans_model.fit(X)

In [None]:
kmeans_model.labels_

In [None]:
new_df['cluster'] = kmeans_model.labels_
new_df

In [None]:
new_df.plot.scatter(x=1, y=0, c='cluster')

In [None]:
new_df.plot.scatter(x=1, y=0, c='cluster', cmap='winter')

#### ColorMaps in MatplotLib

https://matplotlib.org/stable/tutorials/colors/colormaps.html

In [None]:
new_df.plot.scatter(x=1, y=0, c='cluster', cmap='tab20')

In [None]:
new_df.plot.scatter(x=1, y=0, c='cluster', cmap='winter')

## More dimensionality reduction techniques in sklearn

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition

The textbook *(text analysis in python for social scientists)* discusses more dimensionality reduction methods (e.g NMF Nonnegative Matrix Factorization, T-SNE) 

(back to slides)

### TSNE

In [None]:
from sklearn.manifold import TSNE
tsne_transformed = TSNE(n_components=2).fit_transform(X)
tsne_transformed.shape

In [None]:
tsne_df = pd.DataFrame(tsne_transformed)
tsne_df['cluster'] = kmeans_model.labels_

In [None]:
tsne_df.plot.scatter(x=1, y=0, c='cluster', cmap='winter')