# Modeling
### Data Preparation:

1. **Handling Missing Values:**
   - Check the dataset for any missing values and decide on an appropriate strategy for handling them. This could involve imputation (e.g., filling missing values with the mean, median, or mode) or removal of rows or columns with missing values, depending on the data and the extent of missingness.

2. **Scaling Numerical Features:**
   - Scale numerical features to ensure that they have similar scales, which can improve the performance of certain machine learning algorithms, especially those sensitive to feature magnitudes (e.g., gradient descent-based algorithms).
   - Common scaling techniques include standardization (scaling to have zero mean and unit variance) or normalization (scaling to a range, typically [0, 1]).

3. **Splitting the Dataset:**
   - Split the dataset into training and testing sets to train and evaluate the machine learning models.
   - Typically, a common split is 80% for training and 20% for testing, but the split ratio can be adjusted based on the size of the dataset and specific requirements.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Load the historical data
netflix_data = pd.read_csv("cleaned_netflix_data.csv")

# Check for missing values
missing_values = netflix_data.isnull().sum()
print("Missing Values:\n", missing_values)

# Separate features and target variable
X = netflix_data.drop(columns=['Date', 'Adj Close'])  
y = netflix_data['Adj Close']  

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the training and testing sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Missing Values:
 Date         0
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64
Shape of X_train: (4028, 5)
Shape of X_test: (1007, 5)
Shape of y_train: (4028,)
Shape of y_test: (1007,)


## Feature Selection:
Identify Relevant Features:

**Historical Financial Data:** Features such as 'Open', 'High', 'Low', 'Close', and 'Volume' are essential as they directly influence share prices and can provide insights into market trends and trading activity.

**Calendar Features:** Incorporate calendar-related features such as day of the week, month, or quarter, which may influence trading behavior and market dynamics.

**Create Lag Features:** Generate lagged versions of the existing features to capture temporal dependencies and trends in the data. For example, lagged values of 'Close' prices from previous days could be informative for predicting future prices.

**Rolling Window Statistics:** Compute rolling window statistics (moving averages, standard deviations) over different time periods to capture short-term and long-term trends in the data.

In [2]:
import pandas as pd

# Load the historical data
netflix_data = pd.read_csv("cleaned_netflix_data.csv")

# Feature Engineering
# Create Lag Features
netflix_data['Close_Lag1'] = netflix_data['Close'].shift(1)  # Lagged Close price from previous day
netflix_data['Close_Lag7'] = netflix_data['Close'].shift(7)  # Lagged Close price from a week ago

# Rolling Window Statistics
netflix_data['Rolling_Mean_Close'] = netflix_data['Close'].rolling(window=30).mean()  # 30-day moving average of Close prices
netflix_data['Rolling_Std_Close'] = netflix_data['Close'].rolling(window=30).std()    # 30-day rolling standard deviation of Close prices

# Calendar Features
netflix_data['Day_of_Week'] = pd.to_datetime(netflix_data['Date']).dt.dayofweek  # Day of the week (0 = Monday, 6 = Sunday)
netflix_data['Month'] = pd.to_datetime(netflix_data['Date']).dt.month              # Month of the year (1 to 12)

# Display the updated dataframe with new features
print(netflix_data.head())

         Date      Open      High       Low     Close  Adj Close    Volume  \
0  2004-01-26  5.464286  5.505714  5.375000  5.447143   5.447143  31306800   
1  2004-01-27  5.428571  5.681429  5.350714  5.370714   5.370714  67012400   
2  2004-01-28  5.365714  5.455714  5.053571  5.135714   5.135714  49611800   
3  2004-01-29  5.209286  5.216429  4.906429  5.053571   5.053571  56393400   
4  2004-01-30  5.033571  5.333571  5.013571  5.243571   5.243571  35644000   

   Close_Lag1  Close_Lag7  Rolling_Mean_Close  Rolling_Std_Close  Day_of_Week  \
0         NaN         NaN                 NaN                NaN            0   
1    5.447143         NaN                 NaN                NaN            1   
2    5.370714         NaN                 NaN                NaN            2   
3    5.135714         NaN                 NaN                NaN            3   
4    5.053571         NaN                 NaN                NaN            4   

   Month  
0      1  
1      1  
2      1  


## Model -- random forest

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Adjust these based on your actual data and features
X = netflix_data[['Open', 'High', 'Low', 'Close', 'Volume']]
y = netflix_data['Adj Close']

# Train-Validation Split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Hyperparameter Tuning
n_estimators = 100
max_depth = 10
min_samples_split = 2
min_samples_leaf = 1

# Define Random Forest model
rf_model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth,
                                 min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf,
                                 random_state=42)

# Model Training
rf_model.fit(X_train, y_train)

# Feature Importance
feature_importance = pd.DataFrame({'Feature': X_train.columns, 'Importance': rf_model.feature_importances_})
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)
print("Feature Importance:\n", feature_importance)

# Model Evaluation on Validation Set
y_val_pred_rf = rf_model.predict(X_val)
mse_rf = mean_squared_error(y_val, y_val_pred_rf)

print("\nValidation Set Mean Squared Error (Random Forest):", mse_rf)


Feature Importance:
   Feature  Importance
3   Close    0.936211
1    High    0.023610
2     Low    0.023396
0    Open    0.016782
4  Volume    0.000001

Validation Set Mean Squared Error (Random Forest): 0.08711301187256565


### Model Training

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load the data (replace 'filename.csv' with your actual file)
netflix_data = pd.read_csv('netflix.csv')

# Adjust these based on your actual data and features
X = netflix_data[['Open', 'High', 'Low', 'Close', 'Volume']]
y = netflix_data['Adj Close']

# Train-Validation Split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Define Random Forest model
rf_model = RandomForestRegressor(random_state=42)

# Perform Grid Search Cross-Validation for hyperparameter tuning
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print the best hyperparameters found by Grid Search
print("Best Hyperparameters:", grid_search.best_params_)

# Get the best model from Grid Search
best_rf_model = grid_search.best_estimator_

# Model Evaluation on Validation Set
y_val_pred_rf = best_rf_model.predict(X_val)
mse_rf = mean_squared_error(y_val, y_val_pred_rf)

print("\nValidation Set Mean Squared Error (Random Forest):", mse_rf)


Best Hyperparameters: {'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 150}

Validation Set Mean Squared Error (Random Forest): 0.0754588208173375


## Saving Model

In [5]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import joblib

# Load your dataset
netflix_data = pd.read_csv('netflix.csv')

# Adjust these based on your actual data and features
X = netflix_data[['Open', 'High', 'Low', 'Close', 'Volume']]
y = netflix_data['Adj Close']

# Train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=150, max_depth=None, min_samples_leaf=2, min_samples_split=2, random_state=42)
rf_model.fit(X, y)

# Save the trained model using joblib
joblib.dump(rf_model, 'random_forest_model.pkl')

print("Model saved successfully!")

Model saved successfully!
