# <h3 align="center">__Module 1 Activity__</h3>
# <h3 align="center">__Assigned at the start of Module 1__</h3>
# <h3 align="center">__Due at the end of Module 1__</h3><br>

# Weekly Discussion Forum Participation

Each week, you are required to participate in the module’s discussion forum. The discussion forum consists of the week's Module Activity, which is released at the beginning of the module. You must complete/attempt the activity before you can post about the activity and anything that relates to the topic.

## Grading of the Discussion

### 1. Initial Post:
Create your thread by **Day 5 (Saturday night at midnight, PST).**

### 2. Responses:
Respond to at least two other posts by **Day 7 (Monday night at midnight, PST).**

---

## Grading Criteria:

Your participation will be graded as follows:

### Full Credit (100 points):
- Submit your initial post by **Day 5.**
- Respond to at least two other posts by **Day 7.**

### Half Credit (50 points):
- If your initial post is late but you respond to two other posts.
- If your initial post is on time but you fail to respond to at least two other posts.

### No Credit (0 points):
- If both your initial post and responses are late.
- If you fail to submit an initial post and do not respond to any others.

---

## Additional Notes:

- **Late Initial Posts:** Late posts will automatically receive half credit if two responses are completed on time.
- **Substance Matters:** Responses must be thoughtful and constructive. Comments like “Great post!” or “I agree!” without further explanation will not earn credit.
- **Balance Participation:** Aim to engage with threads that have fewer or no responses to ensure a balanced discussion.

---

## Avoid:
- A number of posts within a very short time-frame, especially immediately prior to the posting deadline.
- Posts that complement another post, and then consist of a summary of that.


# Data Modeling Activity

## Objective
Familiarize yourself with key data modeling concepts (classification, regression, and clustering) by analyzing a dataset and discussing how different modeling approaches apply.

---

## Instructions

### 1. Find a Dataset
Use the `kagglehub` library to download a dataset from Kaggle. Ensure that you have set up the library with your Kaggle API key. An example of how to do this is provided below:

```python
# Install kagglehub if not already installed
# !pip install kagglehub

import kagglehub as kh

# Example: Downloading a dataset
# Replace 'dataset-owner/dataset-name' with the Kaggle dataset identifier
kh.download('dataset-owner/dataset-name', path='./datasets')

# Load the dataset
import pandas as pd
data = pd.read_csv("./datasets/your_dataset.csv")

# Display the first few rows of the dataset
data.head()

# Basic dataset summary
data.info()
data.describe()
```

---

### 2. Dataset Exploration
Answer the following questions about your dataset:
1. What types of features are present (numerical, categorical, etc.)?
2. Are there any missing values or data quality issues?
3. What potential problems can this dataset solve? Identify tasks for classification, regression, and clustering.

Add your answers in the markdown cell below.

---

### 3. Tasks

#### **Classification**
- Define a classification problem in the dataset (e.g., predicting if a customer will churn).
- Optionally, implement a simple logistic regression or decision tree model.

```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Replace 'target' with your target column name
X = data.drop(columns=["target"])
y = data["target"]

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
```

#### **Regression**
- Define a regression problem in the dataset (e.g., predicting house prices).
- Optionally, implement a simple linear regression model.

```python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Replace 'target' with your target column name
X = data.drop(columns=["target"])
y = data["target"]

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = model.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R-squared Score:", r2_score(y_test, y_pred))
```

#### **Clustering**
- Define how clustering might reveal patterns in your dataset.
- Optionally, implement a simple k-means clustering algorithm.

```python
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Use only numerical features for clustering
X = data.select_dtypes(include=['float64', 'int64'])

# Train a k-means model
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Add cluster labels to the dataset
data["Cluster"] = kmeans.labels_

# Visualize the clusters (for two features, if applicable)
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=data["Cluster"], cmap="viridis")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Clustering Visualization")
plt.show()
```

---

### 4. Reflection and Discussion
- How does the choice of supervised or unsupervised learning depend on the dataset and problem?
- What challenges did you anticipate when applying each modeling technique?

Add your reflections below.

---

## Deliverables
1. A short description of your dataset and the problems you identified.
2. Code and results for any implemented models (classification, regression, clustering).
3. A brief reflection on the activity.


# Import Data from Kagglehub

In [35]:
import kagglehub as kh
import pandas as pd
import os

# Download the dataset
dataset_path = kh.dataset_download("samanfatima7/2020-2025-google-stock-dataset")
print(f"Dataset downloaded to: {dataset_path}")

# The dataset contains a CSV file called 'google_5yr_one.csv'
csv_file_path = os.path.join(dataset_path, "google_5yr_one.csv")
print(f"Loading CSV file from: {csv_file_path}")

# Load the dataset
data = pd.read_csv(csv_file_path)

print("First 5 rows of the dataset:")
print(data.head())

print("\nDataset description:")
print(data.describe())


# Remove the first row
data = data.drop(0)

print("*"*50)
print(data.head())

Dataset downloaded to: /Users/briancolclough/.cache/kagglehub/datasets/samanfatima7/2020-2025-google-stock-dataset/versions/1
Loading CSV file from: /Users/briancolclough/.cache/kagglehub/datasets/samanfatima7/2020-2025-google-stock-dataset/versions/1/google_5yr_one.csv
First 5 rows of the dataset:
         Date              Close               High                Low  \
0         NaN              GOOGL              GOOGL              GOOGL   
1  2020-06-04   70.3785171508789  71.72309429138843  69.96599205492319   
2  2020-06-05  71.65840148925781   71.9709103787135   70.0461071028752   
3  2020-06-08  72.05748748779297  72.10525562528537  70.88509140875318   
4  2020-06-09  72.25852966308594  73.04079279119881  71.77484210279437   

                Open    Volume  
0              GOOGL     GOOGL  
1   71.4971694316438  26982000  
2  70.44520002096422  42642000  
3    70.974667107052  33878000  
4  71.91816171630913  33624000  

Dataset description:
              Date               Cl

# 2. Dataset Exploration
Answer the following questions about your dataset:
1. What types of features are present (numerical, categorical, etc.)?
2. Are there any missing values or data quality issues?
3. What potential problems can this dataset solve? Identify tasks for classification, regression, and clustering.

**1. Feature Types:**
- **Numerical features**: Open, Close, High, Low prices (continuous financial data), and trading volume
- Also includes data so we know the day each of these features was measured.

**2. Data Quality Issues:**
- The first row contains header-like values ("GOOGL") instead of numerical data which are of no use to us
- No apparent missing values in the numerical columns after cleaning
- Data appears to be clean and well-structured for time series analysis

**3. Potential Machine Learning Problems:**

**Classification Tasks:**
- **Direction Prediction**: Predict whether the stock will go up or down the next day

**Regression Tasks:**
- **Price Prediction**: Predict future closing prices based on historical open, high, low prices and volume
- **Next Day Open Price**: Predict tomorrow's opening price using today's trading data

**Clustering Tasks:**
- **Trading Pattern Discovery**: Group days with similar trading patterns (price movements and volume)
- **Market Regime Identification**: Identify different market conditions (bull market, bear market, sideways) through clustering



# **Classification**
- Define a classification problem in the dataset (e.g., predicting if a customer will churn).
- Optionally, implement a simple logistic regression or decision tree model.

From here were going to predict if the stock price will go up or down the on the next day of the market.

In [36]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np

# Convert to numeric first
numeric_columns = ['Open', 'High', 'Low', 'Close', 'Volume']
data[numeric_columns] = data[numeric_columns].apply(pd.to_numeric, errors='coerce')

# Create target variable (tomorrow's open > today's close)
data["Tomorrow_Open_Higher_Than_Today_Close"] = (data["Open"].shift(-1) > data["Close"]).astype(int)

# Drop rows with NaN values (last row will have NaN)
data = data.dropna()

print("Class distribution:")
print(data["Tomorrow_Open_Higher_Than_Today_Close"].value_counts())

# Prepare features and target
X = data.drop(columns=["Tomorrow_Open_Higher_Than_Today_Close", "Date"])
y = data["Tomorrow_Open_Higher_Than_Today_Close"]

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with balanced class weights
model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Class distribution:
Tomorrow_Open_Higher_Than_Today_Close
1    661
0    594
Name: count, dtype: int64
Accuracy: 0.5378486055776892
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00       116
           1       0.54      1.00      0.70       135

    accuracy                           0.54       251
   macro avg       0.27      0.50      0.35       251
weighted avg       0.29      0.54      0.38       251



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
