<a href="https://colab.research.google.com/github/NUELBUNDI/Machine-Learning-Projects/blob/main/ML_Logistic_Regression_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import os
import plotly.express as px


In [None]:
# read the file

url ="https://raw.githubusercontent.com/NUELBUNDI/Machine-Learning-Projects/main/weatherAUS.csv"

df=pd.read_csv(url)
df.shape

In [None]:
# Visualize the first 5 rows

df.head(5)

In [None]:
print(df.columns, end=",")

In [None]:
non_numerical_data=[]
numerical_data=[]
for col in df:
  results= (df[col].dtype)
  if results == 'float64' or results == 'int64':
    numerical_data.append(col)
  else:
    non_numerical_data.append(col)


print(f'The Numerical Data  :{numerical_data}\n')
print()
print(f'The Non-Numerical Data :{non_numerical_data}\n')




# numeric_cols = train_inputs.select_dtypes(include=np.number).columns.tolist()
# categorical_cols = train_inputs.select_dtypes('object').columns.tolist()
  

In [None]:
df.nunique()

In [None]:
# Drop RainTomorrow and RainToday if Nan

df.dropna(subset=['RainTomorrow', 'RainToday'], inplace=True)

## Exploratory Data Analysis and Visualization

In [None]:
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size']=14
matplotlib.rcParams['figure.figsize']=(10,6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [None]:
px.histogram(df, x='Location',
              title='Location Vs Rainy Day',
              color='RainToday')

In [None]:
px.histogram(df, 
             x='Temp3pm', 
             title='Temperature at 3 pm vs. Rain Tomorrow', 
             color='RainTomorrow')

In [None]:
px.scatter(df.head(2000),
           title='Min Temp Vs Max Temp',
           x='MinTemp',
           y='MaxTemp',
           color='RainToday')

In [None]:
fig, ax = plt.subplots(figsize=(16, 12))
sns.heatmap(df.corr(),  annot=True ,ax=ax)
plt.show()

## Training, Validation and Test Sets


In [None]:
# Split the data into trin  validate and test

from sklearn.model_selection import train_test_split

train_val_df , test_df = train_test_split(df, test_size=0.2 , random_state=42)


train_df , val_df = train_test_split(train_val_df, test_size=0.25 , random_state=42)



print(f'df :{df.shape}')

print(f'train_df :{train_df.shape}')

print(f'test_df :{test_df.shape}')

print(f'val_df :{val_df.shape}')

working with dates, it's often a better idea to separate the training, validation and test sets with time, so that the model is trained on data from the past and evaluated on data from the future.

For the current dataset, we can use the Date column in the dataset to create another column for year. We'll pick the last two years for the test set, and one year before it for the validation set.

In [None]:
plt.title('No of Rows Year')
sns.countplot(x=pd.to_datetime(df.Date).dt.year)

In [None]:
# df['year']= pd.to_datetime(df.Date).dt.year

df.drop(columns=['year'],inplace=True)

In [None]:
# df['year']= pd.to_datetime(df.Date).dt.year

year= pd.to_datetime(df.Date).dt.year

train_df=df[year<2015]
val_df =df[year==2015]
test_df=df[year>2015]

print(f'train_df : {train_df.shape}')
print(f'val_df : {val_df.shape}')
print(f'test_df : {test_df.shape}')

## Identifying Input and Target Columns


Often, not all the columns in a dataset are useful for training a model. In the current dataset, we can ignore the Date column, since we only want to weather conditions to make a prediction about whether it will rain the next day.

Let's create a list of input columns, and also identify the target column.

In [None]:
input_cols= list(train_df.columns)[1:-1]

target_col ='RainTomorrow'

In [None]:
print(input_cols)
print(target_col)

In [None]:
train_inputs = train_df[input_cols].copy()
train_targets = train_df[target_col].copy()

val_inputs = val_df[input_cols].copy()
val_targets = val_df[target_col].copy()


test_inputs = test_df[input_cols].copy()
test_targets = test_df[target_col].copy()

In [None]:
numeric_cols = train_inputs.select_dtypes(include=np.number).columns.tolist()
categorical_cols = train_inputs.select_dtypes('object').columns.tolist()

In [None]:
train_inputs[numeric_cols].describe()

Do the ranges of the numeric columns seem reasonable? If not, we may have to do some data cleaning as well.



In [None]:
# ChecK nO OF UNIQUE VALUES IN CAATEGORICAL DATA

train_inputs[categorical_cols].nunique()

Imputing Missing Numeric Data
Machine learning models can't work with missing numerical data.

 The process of filling missing values is called **imputation**.

In [None]:
# Check no of missing values

df[numeric_cols].isna().sum()

train_inputs[numeric_cols].isna().sum()

val_inputs[numeric_cols].isna().sum()
test_inputs[numeric_cols].isna().sum()

There are several techniques for imputation, but we'll use the most basic one: replacing missing values with the average value in the column using the SimpleImputer class from sklearn.impute

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imputer = SimpleImputer(strategy='mean')

The first step in imputation is to fit the imputer to the data i.e. compute the chosen statistic (e.g. mean) for each column in the dataset.

In [None]:
imputer.fit(df[numeric_cols])

After calling fit, the computed statistic for each column is stored in the statistics_ property of imputer.



In [None]:
list(imputer.statistics_)

The missing values in the training, test and validation sets can now be filled in using the transform method of imputer.

In [None]:
train_inputs[numeric_cols]= imputer.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols]= imputer.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols]= imputer.transform(test_inputs[numeric_cols])

The missing values are now filled in with the mean of each column.



In [None]:
train_inputs[numeric_cols].isna().sum()

 learn more about other imputation techniques here: https://scikit-learn.org/stable/modules/impute.html

## Scaling Numeric Features

Another good practice is to scale numeric features to a small range of values e.g. 
(
0
,
1
)
 or 
(
−
1
,
1
)
. Scaling numeric features ensures that no particular feature has a disproportionate impact on the model's loss. Optimization algorithms also work better in practice with smaller numbers.

The numeric columns in our dataset have varying ranges.

Let's use MinMaxScaler from sklearn.preprocessing to scale values to the 
(
0
,
1
)
 range

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

#First we fit the scaler to data  compute the range of values for each numerica column

scaler.fit(df[numeric_cols])

In [None]:
# Inspect the Minimum and Maximum vlaues in each column

print(f'Minimum: {list(scaler.data_min_)}')

print(f'Maximum: {list(scaler.data_max_)}')

In [None]:
# We can now separately scale the training, validation and test sets using the transform method of scaler.


train_inputs[numeric_cols] = scaler.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = scaler.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])

In [None]:
# Verify that values in each column lie in range (0,1)

train_inputs[numeric_cols].describe()

## Encoding Categorical Data

Since machine learning models can only be trained with numeric data, we need to convert categorical data to numbers. A common technique is to use one-hot encoding for categorical columns.



In [None]:
df[categorical_cols].nunique()

We can perform one hot encoding using the OneHotEncoder class from sklearn.preprocessing.



In [None]:
from sklearn.preprocessing import OneHotEncoder


In [None]:
encoder= OneHotEncoder(sparse=False,handle_unknown='ignore')


First, we fit the encoder to the data i.e. identify the full list of categories across all categorical columns.



In [None]:
encoder.fit(df[categorical_cols])

In [None]:
encoder.categories_

The encoder has created a list of categories for each of the categorical columns in the dataset.



In [None]:
# We can generate column names for each individual category using get_feature_names.

encoded_cols= list(encoder.get_feature_names_out(categorical_cols))
print(encoded_cols)

All of the above columns will be added to train_inputs, val_inputs and test_inputs.



To perform the encoding, we use the transform method of encoder.

In [None]:
train_inputs[encoded_cols] = encoder.transform(train_inputs[categorical_cols])
val_inputs[encoded_cols] = encoder.transform(val_inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
test_inputs

## Saving Processed Data to Disk

It can be useful to save processed data to disk, especially for really large datasets, to avoid repeating the preprocessing steps every time you start the Jupyter notebook. The parquet format is a fast and efficient format for saving and loading Pandas dataframes.

In [None]:
print('train_inputs:', train_inputs.shape)
print('train_targets:', train_targets.shape)
print('val_inputs:', val_inputs.shape)
print('val_targets:', val_targets.shape)
print('test_inputs:', test_inputs.shape)
print('test_targets:', test_targets.shape)

In [None]:
train_inputs.to_parquet('train_inputs.parquet')
val_inputs.to_parquet('val_inputs.parquet')
test_inputs.to_parquet('test_inputs.parquet')

In [None]:
%%time
pd.DataFrame(train_targets).to_parquet('train_targets.parquet')
pd.DataFrame(val_targets).to_parquet('val_targets.parquet')
pd.DataFrame(test_targets).to_parquet('test_targets.parquet')

We can read the data back using pd.read_parquet.



In [None]:
%%time

train_inputs = pd.read_parquet('train_inputs.parquet')
val_inputs = pd.read_parquet('val_inputs.parquet')
test_inputs = pd.read_parquet('test_inputs.parquet')

train_targets = pd.read_parquet('train_targets.parquet')[target_col]
val_targets = pd.read_parquet('val_targets.parquet')[target_col]
test_targets = pd.read_parquet('test_targets.parquet')[target_col]


Let's verify that the data was loaded properly.



In [None]:
print('train_inputs:', train_inputs.shape)
print('train_targets:', train_targets.shape)
print('val_inputs:', val_inputs.shape)
print('val_targets:', val_targets.shape)
print('test_inputs:', test_inputs.shape)
print('test_targets:', test_targets.shape)

In [None]:
val_inputs

## Training a Logistic Regression Model

Logistic regression is a commonly used technique for solving binary classification problems. In a logistic regression model:

we take linear combination (or weighted sum of the input features)

we apply the sigmoid function to the result to obtain a number between 0 and 1

this number represents the probability of the input being classified as "Yes"

instead of RMSE, the cross entropy loss function is used to evaluate the results

In [None]:
from sklearn.linear_model import LogisticRegression


In [None]:
# ?LogisticRegression

In [None]:
model = LogisticRegression(solver='liblinear')


In [None]:
model.fit(train_inputs[numeric_cols + encoded_cols], train_targets)


In [None]:
print(numeric_cols + encoded_cols)

In [None]:
print(model.coef_.tolist())

In [None]:
print(model.intercept_)

## Making Predictions and Evaluating the Model


In [None]:
X_train = train_inputs[numeric_cols + encoded_cols]
X_val = val_inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]

In [None]:
train_preds= model.predict(X_train)

train_preds

In [None]:
train_targets

We can output a probabilistic prediction using predict_proba.



In [None]:
train_probs=model.predict_proba(X_train)
train_probs

The numbers above indicate the probabilities for the target classes "No" and "Yes".



We can test the accuracy of the model's predictions by computing the percentage of matching values in train_preds and train_targets.
This can be done using the accuracy_score function from sklearn.metrics.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(train_targets, train_preds)

The model achieves an accuracy of 85.1% on the training set. We can visualize the breakdown of correctly and incorrectly classified inputs using a confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix(train_targets, train_preds, normalize='true')

Let's define a helper function to generate predictions, compute the accuracy score and plot a confusion matrix for a given st of inputs.

In [None]:
def predict_and_plot(inputs, targets, name=''):
  preds=model.predict(inputs)

  accuracy=accuracy_score(targets, preds)
  print(f'Accuracy : {accuracy*100:.2f}')

  cf =confusion_matrix(targets,preds,normalize ='true')
  plt.figure()
  sns.heatmap(cf,annot=True)
  plt.xlabel('Prediction')
  plt.ylabel('Target')
  plt.title((f'{name} Confusion Matrix'))

  return preds

In [None]:
train_preds = predict_and_plot(X_train, train_targets, 'Training')


In [None]:
val_preds = predict_and_plot(X_val, val_targets, 'Validatiaon')


In [None]:
test_preds = predict_and_plot(X_test, test_targets, 'Test')


The accuracy of the model on the test and validation set are above 84%, which suggests that our model generalizes well to data it hasn't seen before.

But how good is 84% accuracy? While this depends on the nature of the problem and on business requirements, a good way to verify whether a model has actually learned something useful is to compare its results to a "random" or "dumb" model.

Let's create two models: one that guesses randomly and another that always return "No". Both of these models completely ignore the inputs given to them.

In [None]:
def random_guess(inputs):
    return np.random.choice(["No", "Yes"], len(inputs))

def all_no(inputs):
  return np.full(len(inputs), "No")


In [None]:
accuracy_score(test_targets, random_guess(X_test))

In [None]:
accuracy_score(test_targets, all_no(X_test))

Our random model achieves an accuracy of 50% and our "always No" model achieves an accuracy of 77%.

Thankfully, our model is better than a "dumb" or "random" model! This is not always the case, so it's a good practice to benchmark any model you train against such baseline models.

## Making Predictions on a Single Input

Once the model has been trained to a satisfactory accuracy, it can be used to make predictions on new data. Consider the following dictionary containing data collected from the Katherine weather department toda

In [None]:
new_input = {'Date': '2021-06-19',
             'Location': 'Katherine',
             'MinTemp': 23.2,
             'MaxTemp': 33.2,
             'Rainfall': 10.2,
             'Evaporation': 4.2,
             'Sunshine': np.nan,
             'WindGustDir': 'NNW',
             'WindGustSpeed': 52.0,
             'WindDir9am': 'NW',
             'WindDir3pm': 'NNE',
             'WindSpeed9am': 13.0,
             'WindSpeed3pm': 20.0,
             'Humidity9am': 89.0,
             'Humidity3pm': 58.0,
             'Pressure9am': 1004.8,
             'Pressure3pm': 1001.5,
             'Cloud9am': 8.0,
             'Cloud3pm': 5.0,
             'Temp9am': 25.7,
             'Temp3pm': 33.0,
             'RainToday': 'Yes'}

The first step is to convert the dictionary into a Pandas dataframe, similar to raw_df. This can be done by passing a list containing the given dictionary to the pd.DataFrame constructor.

In [None]:

new_input_df = pd.DataFrame([new_input])

We must now apply the same transformations applied while training the model:

Imputation of missing values using the imputer created earlier
Scaling numerical features using the scaler created earlier
Encoding categorical features using the encoder created earlier

In [None]:
new_input_df[numeric_cols] = imputer.transform(new_input_df[numeric_cols])
new_input_df[numeric_cols] = scaler.transform(new_input_df[numeric_cols])
new_input_df[encoded_cols] = encoder.transform(new_input_df[categorical_cols])

In [None]:
X_new_input = new_input_df[numeric_cols + encoded_cols]
X_new_input

We can now make a prediction using model.predict.

In [None]:
prediction = model.predict(X_new_input)[0]

In [None]:
prediction

In [None]:
prob = model.predict_proba(X_new_input)[0]
prob

Let's define a helper function to make predictions for individual inputs.



In [None]:
def predict_input(single_input):
    input_df = pd.DataFrame([single_input])
    input_df[numeric_cols] = imputer.transform(input_df[numeric_cols])
    input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])
    input_df[encoded_cols] = encoder.transform(input_df[categorical_cols])
    X_input = input_df[numeric_cols + encoded_cols]
    pred = model.predict(X_input)[0]
    prob = model.predict_proba(X_input)[0][list(model.classes_).index(pred)]
    return pred, prob

In [None]:
new_input = {'Date': '2021-06-19',
             'Location': 'Launceston',
             'MinTemp': 23.2,
             'MaxTemp': 33.2,
             'Rainfall': 10.2,
             'Evaporation': 4.2,
             'Sunshine': np.nan,
             'WindGustDir': 'NNW',
             'WindGustSpeed': 52.0,
             'WindDir9am': 'NW',
             'WindDir3pm': 'NNE',
             'WindSpeed9am': 13.0,
             'WindSpeed3pm': 20.0,
             'Humidity9am': 89.0,
             'Humidity3pm': 58.0,
             'Pressure9am': 1004.8,
             'Pressure3pm': 1001.5,
             'Cloud9am': 8.0,
             'Cloud3pm': 5.0,
             'Temp9am': 25.7,
             'Temp3pm': 33.0,
             'RainToday': 'Yes'}

In [None]:
predict_input(new_input)

## Saving and Loading Trained Models

We can save the parameters (weights and biases) of our trained model to disk, so that we needn't retrain the model from scratch each time we wish to use it. Along with the model, it's also important to save imputers, scalers, encoders and even column names. Anything that will be required while generating predictions using the model should be saved.

We can use the joblib module to save and load Python objects on the disk.

In [None]:
import joblib

Let's first create a dictionary containing all the required objects.



In [None]:
rain2morrow = {
    'model': model,
    'imputer': imputer,
    'scaler': scaler,
    'encoder': encoder,
    'input_cols': input_cols,
    'target_col': target_col,
    'numeric_cols': numeric_cols,
    'categorical_cols': categorical_cols,
    'encoded_cols': encoded_cols
}

We can now save this to a file using joblib.dump

In [None]:
joblib.dump(rain2morrow, 'rain2morrow.joblib')

In [None]:
rain2morrow = joblib.load('rain2morrow.joblib')

Let's use the loaded model to make predictions on the original test set.

In [None]:
test_preds2 = rain2morrow['model'].predict(X_test)
accuracy_score(test_targets, test_preds2)

# Putting it all Together

While we've covered a lot of ground in this tutorial, the number of lines of code for processing the data and training the model is fairly small. Each step requires no more than 3-4 lines of code

## Data Preprocessing

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# Download the dataset

url ="https://raw.githubusercontent.com/NUELBUNDI/Machine-Learning-Projects/main/weatherAUS.csv"
df=pd.read_csv(url)
raw_df=df
raw_df.dropna(subset=['RainToday', 'RainTomorrow'], inplace=True)


# Create training, validation and test sets
year = pd.to_datetime(raw_df.Date).dt.year
train_df, val_df, test_df = raw_df[year < 2015], raw_df[year == 2015], raw_df[year > 2015]

# Create inputs and targets
input_cols = list(train_df.columns)[1:-1]
target_col = 'RainTomorrow'
train_inputs, train_targets = train_df[input_cols].copy(), train_df[target_col].copy()
val_inputs, val_targets = val_df[input_cols].copy(), val_df[target_col].copy()
test_inputs, test_targets = test_df[input_cols].copy(), test_df[target_col].copy()

# Identify numeric and categorical columns
numeric_cols = train_inputs.select_dtypes(include=np.number).columns.tolist()[:-1]
categorical_cols = train_inputs.select_dtypes('object').columns.tolist()

# Impute missing numerical values
imputer = SimpleImputer(strategy = 'mean').fit(raw_df[numeric_cols])
train_inputs[numeric_cols] = imputer.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = imputer.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = imputer.transform(test_inputs[numeric_cols])

# Scale numeric features
scaler = MinMaxScaler().fit(raw_df[numeric_cols])
train_inputs[numeric_cols] = scaler.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = scaler.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])

# One-hot encode categorical features
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(raw_df[categorical_cols])
encoded_cols = list(encoder.get_feature_names(categorical_cols))
train_inputs[encoded_cols] = encoder.transform(train_inputs[categorical_cols])
val_inputs[encoded_cols] = encoder.transform(val_inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

# Save processed data to disk
train_inputs.to_parquet('train_inputs.parquet')
val_inputs.to_parquet('val_inputs.parquet')
test_inputs.to_parquet('test_inputs.parquet')
pd.DataFrame(train_targets).to_parquet('train_targets.parquet')
pd.DataFrame(val_targets).to_parquet('val_targets.parquet')
pd.DataFrame(test_targets).to_parquet('test_targets.parquet')

# Load processed data from disk
train_inputs = pd.read_parquet('train_inputs.parquet')
val_inputs = pd.read_parquet('val_inputs.parquet')
test_inputs = pd.read_parquet('test_inputs.parquet')
train_targets = pd.read_parquet('train_targets.parquet')[target_col]
val_targets = pd.read_parquet('val_targets.parquet')[target_col]
test_targets = pd.read_parquet('test_targets.parquet')[target_col]

## Model Training and Evaluation

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib

# Select the columns to be used for training/prediction
X_train = train_inputs[numeric_cols + encoded_cols]
X_val = val_inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]

# Create and train the model
model = LogisticRegression(solver='liblinear')
model.fit(X_train, train_targets)

# Generate predictions and probabilities
train_preds = model.predict(X_train)
train_probs = model.predict_proba(X_train)
accuracy_score(train_targets, train_preds)

# Helper function to predict, compute accuracy & plot confustion matrix
def predict_and_plot(inputs, targets, name=''):
    preds = model.predict(inputs)
    accuracy = accuracy_score(targets, preds)
    print("Accuracy: {:.2f}%".format(accuracy * 100))
    cf = confusion_matrix(targets, preds, normalize='true')
    plt.figure()
    sns.heatmap(cf, annot=True)
    plt.xlabel('Prediction')
    plt.ylabel('Target')
    plt.title('{} Confusion Matrix'.format(name));    
    return preds

# Evaluate on validation and test set
val_preds = predict_and_plot(X_val, val_targets, 'Validation')
test_preds = predict_and_plot(X_test, test_targets, 'Test')

# Save the trained model & load it back
aussie_rain = {'model': model, 'imputer': imputer, 'scaler': scaler, 'encoder': encoder,
               'input_cols': input_cols, 'target_col': target_col, 'numeric_cols': numeric_cols,
               'categorical_cols': categorical_cols, 'encoded_cols': encoded_cols}
joblib.dump(aussie_rain, 'aussie_rain.joblib')
aussie_rain2 = joblib.load('aussie_rain.joblib')

## Prediction on Single Inputs

In [None]:
def predict_input(single_input):
    input_df = pd.DataFrame([single_input])
    input_df[numeric_cols] = imputer.transform(input_df[numeric_cols])
    input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])
    input_df[encoded_cols] = encoder.transform(input_df[categorical_cols])
    X_input = input_df[numeric_cols + encoded_cols]
    pred = model.predict(X_input)[0]
    prob = model.predict_proba(X_input)[0][list(model.classes_).index(pred)]
    return pred, prob

new_input = {'Date': '2021-06-19',
             'Location': 'Launceston',
             'MinTemp': 23.2,
             'MaxTemp': 33.2,
             'Rainfall': 10.2,
             'Evaporation': 4.2,
             'Sunshine': np.nan,
             'WindGustDir': 'NNW',
             'WindGustSpeed': 52.0,
             'WindDir9am': 'NW',
             'WindDir3pm': 'NNE',
             'WindSpeed9am': 13.0,
             'WindSpeed3pm': 20.0,
             'Humidity9am': 89.0,
             'Humidity3pm': 58.0,
             'Pressure9am': 1004.8,
             'Pressure3pm': 1001.5,
             'Cloud9am': 8.0,
             'Cloud3pm': 5.0,
             'Temp9am': 25.7,
             'Temp3pm': 33.0,
             'RainToday': 'Yes'}

predict_input(new_input)

To train a logistic regression model, we can use the LogisticRegression class from Scikit-learn. 
We covered the following topics in this tutorial:

Downloading a real-world dataset from Kaggle

Exploratory data analysis and visualization

Splitting a dataset into training, validation & test sets

Filling/imputing missing values in numeric columns

Scaling numeric features to a 
(
0
,
1
)
 range
Encoding categorical columns as one-hot vectors

Training a logistic regression model using Scikit-learn

Evaluating a model using a validation set and test set

Saving a model to disk and loading it back

# Resources

Check out the following resources to learn more:

https://www.youtube.com/watch?v=-la3q9d7AKQ&list=PLNeKWBMsAzboR8vvhnlanxCNr2V7ITuxy&index=1
https://www.kaggle.com/prashant111/extensive-analysis-eda-fe-modelling
https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction#Baseline
https://jovian.ai/aakashns/03-logistic-regression

## Practise

Try training logistic regression models on the following datasets:

1. [Breast Cancer detection ](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data)

2. [Loan Repayment Prediction](https://www.kaggle.com/competitions/home-credit-default-risk/data)

