## Home Task
### Topic Modeling

In [26]:
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

fn = "data/voted-kaggle-dataset.csv"
df = pd.read_csv(fn)

In [27]:
print("Length of texts = {:,}".format(len(df)))
index = 10 
df.loc[index, "Description"]

Length of texts = 2,150


'These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file. k'

In [28]:
df.head()

Unnamed: 0,Title,Subtitle,Owner,Votes,Versions,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description
0,Credit Card Fraud Detection,Anonymized credit card transactions labeled as...,Machine Learning Group - ULB,1241,"Version 2,2016-11-05|Version 1,2016-11-03",crime\nfinance,CSV,144 MB,ODbL,"442,136 views","53,128 downloads","1,782 kernels",26 topics,https://www.kaggle.com/mlg-ulb/creditcardfraud,The datasets contains transactions made by cre...
1,European Soccer Database,"25k+ matches, players & teams attributes for E...",Hugo Mathien,1046,"Version 10,2016-10-24|Version 9,2016-10-24|Ver...",association football\neurope,SQLite,299 MB,ODbL,"396,214 views","46,367 downloads","1,459 kernels",75 topics,https://www.kaggle.com/hugomathien/soccer,The ultimate Soccer database for data analysis...
2,TMDB 5000 Movie Dataset,"Metadata on ~5,000 movies from TMDb",The Movie Database (TMDb),1024,"Version 2,2017-09-28",film,CSV,44 MB,Other,"446,255 views","62,002 downloads","1,394 kernels",46 topics,https://www.kaggle.com/tmdb/tmdb-movie-metadata,Background\nWhat can we say about the success ...
3,Global Terrorism Database,"More than 170,000 terrorist attacks worldwide,...",START Consortium,789,"Version 2,2017-07-19|Version 1,2016-12-08",crime\nterrorism\ninternational relations,CSV,144 MB,Other,"187,877 views","26,309 downloads",608 kernels,11 topics,https://www.kaggle.com/START-UMD/gtd,"Context\nInformation on more than 170,000 Terr..."
4,Bitcoin Historical Data,Bitcoin data at 1-min intervals from select ex...,Zielak,618,"Version 11,2018-01-11|Version 10,2017-11-17|Ve...",history\nfinance,CSV,119 MB,CC4,"146,734 views","16,868 downloads",68 kernels,13 topics,https://www.kaggle.com/mczielinski/bitcoin-his...,Context\nBitcoin is the longest running and mo...


Preprocess the text data (in this case, 'Description' column)

In [29]:
documents = df['Description'].fillna('') # Fill missing values with empty string

Create a document-term matrix using CountVectorizer

In [31]:
vectorizer = CountVectorizer(max_features = 1000, stop_words = "english")
dtm = vectorizer.fit_transform(documents)

Apply Latent Dirichlet Allocation (LDA)

In [32]:
num_topics = 5
lda = LatentDirichletAllocation(n_components=num_topics, random_state = 42)
lda.fit(dtm)

Display the top words for each topic

In [33]:
feature_names = vectorizer.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print(f"Top words for Topic #{topic_idx + 1}:")
    top_words_idx = topic.argsort()[:-10 - 1:-1]
    top_words = [feature_names[i] for i in top_words_idx]
    print(top_words)

Top words for Topic #1:
['data', 'csv', 'dataset', 'time', 'player', 'number', 'team', 'com', 'game', 'contains']
Top words for Topic #2:
['data', 'dataset', 'information', 'acknowledgements', 'content', 'city', 'context', 'state', 'contains', 'available']
Top words for Topic #3:
['dataset', 'data', 'content', 'context', 'contains', 'images', 'using', 'acknowledgements', 'used', 'use']
Top words for Topic #4:
['university', 'text', 'user', 'data', 'date', 'id', 'title', 'number', 'post', 'users']
Top words for Topic #5:
['data', 'number', 'year', 'age', 'total', 'years', 'health', 'survey', 'dataset', 'country']


Assign topics to documents

In [37]:
topic_assignments = lda.transform(dtm)
df['Topic'] = topic_assignments.argmax(axis=1)

Display the dataframe with assigned topics

In [38]:
print(df[['Description', 'Topic']])

                                            Description  Topic
0     The datasets contains transactions made by cre...      2
1     The ultimate Soccer database for data analysis...      0
2     Background\nWhat can we say about the success ...      0
3     Context\nInformation on more than 170,000 Terr...      1
4     Context\nBitcoin is the longest running and mo...      0
...                                                 ...    ...
2145  Context\nFortnite: Battle Royale has over 20 m...      0
2146  Context\nThis dataset provides the nationaliti...      2
2147  lem.json\nThis file contains lementized englis...      3
2148  Context\nThis data set contains weather data f...      0
2149  Context\nBirths in U.S during 1994 to 2003.\nC...      0

[2150 rows x 2 columns]
