
# üé¨ Midterm Machine Learning Project - YouTube Trending Videos üìä

## 1. Exploratory Data Analysis (EDA)

### Load and Clean Data
```python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("top-1000-trending-youtube-videos.csv")

# Clean numerical columns
for col in ['Video views', 'Likes', 'Dislikes']:
    df[col] = df[col].str.replace(',', '').astype(float)

# Drop rows with missing target values
df.dropna(subset=['Video views', 'Likes'], inplace=True)

# Fill missing categories with "Unknown"
df['Category'] = df['Category'].fillna("Unknown")
```

### Basic Stats & Distribution
```python
df.describe()
sns.histplot(df['Video views'], bins=50, log_scale=True)
plt.title("Distribution of Video Views")
plt.show()
```

## 2. Regression - Predicting Video Views
```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

X = df[['Likes']].fillna(0)
y = df['Video views']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("MAE:", mean_absolute_error(y_test, y_pred))
print("R¬≤ Score:", r2_score(y_test, y_pred))
```

## 3. Classification - Is the Video a Hit?
```python
# Define hit as having more than 10 million views
df['is_hit'] = (df['Video views'] >= 1e7).astype(int)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X_cls = df[['Likes']].fillna(0)
y_cls = df['is_hit']

X_train, X_test, y_train, y_test = train_test_split(X_cls, y_cls, test_size=0.2, random_state=42)

clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))
```

## 4. Clustering - Grouping Videos by Interaction
```python
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

features = df[['Video views', 'Likes']].dropna()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)

kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Visualize with PCA
pca = PCA(n_components=2)
pca_components = pca.fit_transform(X_scaled)

plt.scatter(pca_components[:, 0], pca_components[:, 1], c=clusters, cmap='viridis')
plt.title("Clustering of Videos (PCA)")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.show()
```

## 5. K·∫øt lu·∫≠n / Conclusion
B√°o c√°o bao g·ªìm ph√¢n t√≠ch d·ªØ li·ªáu, d·ª± ƒëo√°n l∆∞·ª£t xem, ph√¢n lo·∫°i video "hot", v√† ph√¢n c·ª•m d·ª±a tr√™n t∆∞∆°ng t√°c. M√¥ h√¨nh h·ªìi quy v√† ph√¢n lo·∫°i ƒë·ªÅu cho k·∫øt qu·∫£ kh·∫£ quan, trong khi ph√¢n c·ª•m gi√∫p hi·ªÉu r√µ h∆°n c√°c nh√≥m video t∆∞∆°ng ƒë·ªìng.

---

**T√°c gi·∫£:** [T√™n b·∫°n]  
**Deadline:** 5/5/2025
