# Machine Learning is all about working with Data and bringing out meaning out of it.

# Data will be in the following formats:

CSV (Comma-Separated Values): Plain text file with values separated by commas, used for tabular data.

Excel (XLS/XLSX): Spreadsheet file formats used by Microsoft Excel, for tabular data.

TSV (Tab-Separated Values): Similar to CSV but with values separated by tabs.

Plain Text (TXT): Unformatted text files, used for raw text data.

JSON (JavaScript Object Notation): Lightweight data-interchange format, Used for structured data and also often used for APIs and data exchange.

XML (eXtensible Markup Language): Used for hierarchical data. Markup language for encoding documents, commonly used in web services.

HTML (HyperText Markup Language): Markup language for creating web pages.

JPEG/JPG: Common image file formats with lossy compression.

PNG: Image file format with lossless compression.

BMP: Bitmap image file format.

GIF: Bitmap image format that supports animations.

WAV: Audio file format for storing waveforms.

MP3: Audio file format with lossy compression.

FLAC: Audio file format with lossless compression.

AAC: Advanced audio coding format with lossy compression.

MP4: Multimedia container format for storing video and audio.

AVI: Multimedia container format introduced by Microsoft.

MOV: Multimedia container format used by QuickTime.

MKV: Multimedia container format for video files.

SQL Databases: Structured data stored in relational databases like MySQL, PostgreSQL.

NoSQL Databases: Structured data stored in non-relational databases like MongoDB, Cassandra.

HDF5: File format and tools for managing complex data.

Shapefile: Geospatial vector data format for GIS software.

GeoJSON: Format for encoding geographic data structures.

FASTA: Text-based format for representing nucleotide or peptide sequences.

VCF (Variant Call Format): Format for storing gene sequence variations.


# CSV (Comma-Separated Values): Used for tabular data.
Headers: Column names in the first row.
Rows: Data entries separated by commas.
#name,age,city
#Precious,72,Mzuzu
#James,63,Blantyre

#Dont run this cell the code below is for demonstration purpose only
# JSON (JavaScript Object Notation): Used for structured data.
{
 "name": "Precious",
 "age": 72,
 "city": "Mzuzu"
}

In [26]:
#Dont run this cell the code below is for demonstration purpose only
# XML (eXtensible Markup Language): Used for hierarchical data.
<person>
  <name>Precious</name>
  <age>72</age>
  <city>Mzuzu</city>
</person>

SyntaxError: invalid syntax (699431307.py, line 3)

# Data formatting

# 1. String Formatting

i. Old Style Formatting (% Operator) This is an older method similar to C's printf-style formatting.

In [31]:
name = "Precious"
age = 72
formatted_string = "My name is %s and I am %d years old." % (name, age)
print(formatted_string)

My name is Precious and I am 72 years old.


ii. New Style Formatting (str.format)
Introduced in Python 3, this method is more powerful and flexible.

In [34]:
name = "Precious"
age = 72
formatted_string = "My name is {} and I am {} years old.".format(name, age)
print(formatted_string)

My name is Precious and I am 72 years old.


In [36]:
name = "Precious"
age = 72
formatted_string = f"My name is {name} and I am {age} years old."
print(formatted_string)

My name is Precious and I am 72 years old.


# 2. Formatting Numbers
i. Specifying Decimal Places

In [39]:
value = 123.456789
formatted_value = "{:.2f}".format(value) #formating to 2sf
print(formatted_value)  # Output: 123.46

# Using f-strings
formatted_value = f"{value:.2f}"
print(formatted_value)  # Output: 123.46

123.46
123.46


In [41]:
value = 123456789
formatted_value = "{:,}".format(value)
print(formatted_value)  # Output: 123,456,789

# Using f-strings
formatted_value = f"{value:,}"
print(formatted_value)  # Output: 123,456,789

123,456,789
123,456,789


# 3. Aligning Text
Left, Right, and Center Alignment

In [44]:
text = "Hello"
print(f"{text:<10}")  # Left align (pad with spaces)
print(f"{text:>10}")  # Right align (pad with spaces)
print(f"{text:^10}")  # Center align (pad with spaces)

Hello     
     Hello
  Hello   


# 4. Formatting Output in Data Structures
Using the pprint module:

It is particularly useful when dealing with complex nested data structures such as lists of dictionaries, , dictionaries of dictionaries, or deeply nested JSON-like objects.

Make it more readable especially for nested structures.:

In [47]:
# Example usage
import pprint

data = [
    {"name": "Precious", "age": 92, "city": "Mzuzu"},
    {"name": "James", "age": 63, "city": "Blantyre"},
    {"name": "Connie", "age": 5, "city": {"name": "Blantyre", "Country": "Malawi"}}
]

pprint.pprint(data)

[{'age': 92, 'city': 'Mzuzu', 'name': 'Precious'},
 {'age': 63, 'city': 'Blantyre', 'name': 'James'},
 {'age': 5,
  'city': {'Country': 'Malawi', 'name': 'Blantyre'},
  'name': 'Connie'}]


In [49]:
The nature of your data and the specific problem you are trying to solve determines whether to use classification, 
prediction (typically referring to regression), or another type of analysis.
Here are some guidelines to help you decide:

SyntaxError: invalid syntax (3547942816.py, line 1)

# 1. Classification: To predict a categorical outcome.
Key Characteristics:
Categorical Target Variable: The output variable is a category or class label.
Finite Set of Possible Outputs: The target variable can take on one of a limited number of discrete values.
Examples:
Spam Detection: Classify emails as "spam" or "not spam".
Image Classification: Classify images into categories like "cat", "dog", "car", etc.
Sentiment Analysis: Classify text as having "positive", "negative", or "neutral" sentiment.
Medical Diagnosis: Classify whether a patient has a certain disease (e.g., "cancer" or "no cancer").

# 2. Regression (Prediction)
Use regression when you want to predict a continuous outcome.

**Key Characteristics:**

**Continuous Target Variable:** The output variable is continuous and numerical.

**Infinite Set of Possible Outputs:** The target variable can take on any value within a range.
Examples:

House Price Prediction: Predict the price of a house based on its features (e.g., size, location).

Stock Price Prediction: Predict the future price of a stock based on historical data.

Sales Forecasting: Predict future sales figures based on past data.

Temperature Prediction: Predict future temperatures based on historical weather data.

# 3. Clustering
Use clustering when you want to group data points into clusters based on similarity.

**Key Characteristics:**
No Predefined Target Variable: You don't have labeled outcomes.
Unsupervised Learning: The algorithm identifies patterns and groups in the data without predefined labels.

Examples:

**Customer Segmentation:** Group customers into segments based on purchasing behavior.

**Anomaly Detection:** Identify unusual patterns that do not fit into any cluster.

**Market Basket Analysis:** Group products frequently bought together.

# 4. Anomaly Detection
Use anomaly detection to identify outliers or abnormal instances in the data.

Key Characteristics:
Rare Events: Focus on identifying rare or unexpected events.
Deviation from Norm: The goal is to detect data points that deviate significantly from the majority of data.

Examples:
Fraud Detection: Detect fraudulent transactions that deviate from normal transaction patterns.

Network Security: Identify unusual network activity that may indicate a security breach.

Quality Control: Detect defective products in a manufacturing process.

# 5. Time Series Analysis
Use time series analysis when your data is sequential and you want to analyze trends over time.

Key Characteristics:
Temporal Data: Data points are collected or recorded at specific time intervals.
Trend Analysis: Focus on identifying patterns, trends, and seasonal variations over time.

Examples:

Weather Forecasting: Predict future weather conditions based on historical data.

Economic Forecasting: Analyze and predict economic indicators like GDP, unemployment rates, etc.

Sales Analysis: Examine sales trends over time to predict future sales.

# How to Decide?
**1.Identify the Problem Type:**

Classification: If the problem involves predicting a category or class.

Regression: If the problem involves predicting a numerical value.

Clustering: If the goal is to group similar data points together without predefined labels.

Anomaly Detection: If the goal is to identify rare or abnormal instances.

Time Series Analysis: If the data involves time-based sequences.

**2. Examine the Target Variable:**

Categorical: Use classification.

Continuous: Use regression.

No Target Variable: Use clustering or anomaly detection.

**3. Consider the Business Context:**

**4. Understand the business problem and the type of decision you need to support.** 
This can often guide you towards the right type of analysis.
**Explore the Data:**

Perform exploratory data analysis (EDA) to understand the characteristics of your data, which can help in determining the appropriate method.

# Machine Learning Basics

Don't let the impressive name fool you. Machine learning is more or less the following steps

1. Getting your data and cleaning it up
1. Identify what parts of your data are **features**
1. Identify what is your **target variable** that you'll guess based on your features
1. Split your data in **training and testing sets**
1. **Train** your model against the training set
1. **Validate** your model against the testing set
1. ????
1. Profit

We are going to use the Python library [scikit-learn](https://scikit-learn.org/stable/) and we are going to be doing a [classification](https://en.wikipedia.org/wiki/Statistical_classification) problem. 

# sklearn is also known as scikit-learn (science Kit)

It is a popular open-source machine learning library for Python. 
It provides simple and efficient tools for data mining and data analysis, making it a valuable resource for building and deploying machine learning models.

**Key Features of scikit-learn:**

Simple and Efficient: Offers simple APIs to implement common machine learning algorithms efficiently.

Wide Range of Algorithms: Includes a variety of machine learning algorithms for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

Built on Other Libraries: Built on top of NumPy, SciPy, and Matplotlib, ensuring compatibility and ease of integration with other scientific libraries in Python.

Well-Documented: Comprehensive documentation and numerous tutorials make it accessible to both beginners and experienced practitioners.

Community Support: Large and active community contributing to its continuous development and improvement.


# Commonly Used Modules in scikit-learn:
Classification: Uses Algorithms like:

Logistic Regression, Support Vector Machines (SVM), Decision Trees, Random Forests, k-Nearest Neighbors (k-NN), and more.

Regression: Linear Regression, Ridge Regression, Lasso, ElasticNet, Support Vector Regression (SVR), and others.

Clustering: K-Means, Agglomerative Clustering, DBSCAN, Mean Shift, etc.

Dimensionality Reduction: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-SNE, etc.

Model Selection: Grid Search, Randomized Search, cross-validation techniques.

Preprocessing: Standardization, normalization, encoding categorical variables, handling missing values, feature extraction, and more.

#Example usage:

A simple example demonstrating how to use scikit-learn for a classification task is using the **Iris dataset:**

Splitting the dataset into training and testing sets is a fundamental step in the machine learning workflow. 
It involves dividing your dataset into two separate parts:

Training Set: A subset of the data used to train the machine learning model. 
The model learns from this data, adjusting its parameters to minimize error or maximize accuracy.
Testing Set: A subset of the data used to evaluate the performance of the trained model. 
This data is not used during the training process, allowing for an unbiased assessment of the model's ability to generalize to new, unseen data.

Importance of Splitting the Dataset:

Helps in evaluating how well the model generalizes to new data.

Prevents overfitting, where the model performs well on training data but poorly on new data.


# Let's start by loading the Libraries we need

In [2]:
#This should look familar to those of you who have done machine learning before
#DecisionTreeClassifier: This is one of the classifiers we can use to classify
import pandas as pd
#import numpy as np

#We'll draw a graph later on
import matplotlib.pyplot as plt

#Our 'Machine Learning pieces of things that we need to help us do this and that on the data'
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split
from sklearn.tree import export_text
from sklearn import metrics
from sklearn import tree

print("Ready to proceed!")

Ready to proceed!


## Getting the data ready

Now, let's load our data. 

Our decision tree can only work with numerical values, so we'll have to modify the columns of data that are text based. As stated preparing the data is usually the most difficult part of the process.

In [3]:
#We will use toy data set. 
data = pd.read_csv("toy_dataset.csv") #Load data
data.head() # Show first few rows of data

Unnamed: 0,Number,City,Gender,Age,Income,Illness
0,1,Dallas,Male,41,40367.0,No
1,2,Dallas,Male,54,45084.0,No
2,3,Dallas,Male,42,52483.0,No
3,4,Dallas,Male,40,40941.0,No
4,5,Dallas,Male,46,50289.0,No


We observe that our target variable is Illness i.e. we want to be able to predict whether there is <b>illness or not</b>.  Which ones are our features?

# Data Cleaning Process.
Data cleaning, a key component of data preprocessing, involves removing or correcting irrelevant, incomplete, or inaccurate data. 
This process is essential because the quality of the data used in machine learning significantly impacts the performance of the models.

Of course, we may say Number, City, Gender and Income. But Number is just used for counting or identification of an individula an therefore it is not a good feature to help us predict our target variable. Let's drop this column.

In [77]:
data.drop(columns=['Number']) #Remove column Number because it is not all that important

Unnamed: 0,City,Gender,Age,Income,Illness
0,Dallas,Male,41,40367.0,No
1,Dallas,Male,54,45084.0,No
2,Dallas,Male,42,52483.0,No
3,Dallas,Male,40,40941.0,No
4,Dallas,Male,46,50289.0,No
...,...,...,...,...,...
149995,Austin,Male,48,93669.0,No
149996,Austin,Male,25,96748.0,No
149997,Austin,Male,26,111885.0,No
149998,Austin,Male,25,111878.0,No


In [79]:
#let's us see if the column has been removed
data.head()

Unnamed: 0,Number,City,Gender,Age,Income,Illness
0,1,Dallas,Male,41,40367.0,No
1,2,Dallas,Male,54,45084.0,No
2,3,Dallas,Male,42,52483.0,No
3,4,Dallas,Male,40,40941.0,No
4,5,Dallas,Male,46,50289.0,No


In [81]:
#Clearly, it is still visible. Now we drop it inplace which means removing permanently
#Use .drop() when you want to manipulate data without modifying the original DataFrame or Series explicitly.
#Use .drop(inplace=True) when you want to modify the original DataFrame or Series directly

data.drop(columns=['Number'], inplace=True)

#Do we still have it?
data.head()

Unnamed: 0,City,Gender,Age,Income,Illness
0,Dallas,Male,41,40367.0,No
1,Dallas,Male,54,45084.0,No
2,Dallas,Male,42,52483.0,No
3,Dallas,Male,40,40941.0,No
4,Dallas,Male,46,50289.0,No


In [4]:
#How many cities do we have? Using unique we show all cities available
print(data['City'].unique())
print(data['City'].nunique())

['Dallas' 'New York City' 'Los Angeles' 'Mountain View' 'Boston'
 'Washington D.C.' 'San Diego' 'Austin']
8


Now, we just need to represent it all as numbers instead of text labels because our models expect numbers only. So that means we need to change the columns:


- `Illness` - instead of a No / Yes label we'll use 0 and 1 instead
- `City` - this will break out the column into 8 different columns
- `Gender` - this will break out the column into 2 different columns




In [86]:
#Instead of yes/no we'll use a 0 or 1
#We use the value '1' to make our analysis later on less ambiguous
data["Illness"].replace({"No":0, "Yes":1},inplace=True)

#We change categorical values into numeric ones using `dummies`
data = pd.get_dummies(data, columns=['City','Gender'], dtype= int) #by default datatype is boolean
data.head(5)

Unnamed: 0,Age,Income,Illness,City_Austin,City_Boston,City_Dallas,City_Los Angeles,City_Mountain View,City_New York City,City_San Diego,City_Washington D.C.,Gender_Female,Gender_Male
0,41,40367.0,0,0,0,1,0,0,0,0,0,0,1
1,54,45084.0,0,0,0,1,0,0,0,0,0,0,1
2,42,52483.0,0,0,0,1,0,0,0,0,0,0,1
3,40,40941.0,0,0,0,1,0,0,0,0,0,0,1
4,46,50289.0,0,0,0,1,0,0,0,0,0,0,1


In [88]:
#This example shows the last 10 entries in the dataframe
data.tail(10)

Unnamed: 0,Age,Income,Illness,City_Austin,City_Boston,City_Dallas,City_Los Angeles,City_Mountain View,City_New York City,City_San Diego,City_Washington D.C.,Gender_Female,Gender_Male
149990,26,82163.0,0,1,0,0,0,0,0,0,0,1,0
149991,51,97510.0,0,1,0,0,0,0,0,0,0,0,1
149992,37,88408.0,0,1,0,0,0,0,0,0,0,0,1
149993,64,89906.0,0,1,0,0,0,0,0,0,0,0,1
149994,37,106097.0,0,1,0,0,0,0,0,0,0,1,0
149995,48,93669.0,0,1,0,0,0,0,0,0,0,0,1
149996,25,96748.0,0,1,0,0,0,0,0,0,0,0,1
149997,26,111885.0,0,1,0,0,0,0,0,0,0,0,1
149998,25,111878.0,0,1,0,0,0,0,0,0,0,0,1
149999,37,87251.0,0,1,0,0,0,0,0,0,0,1,0


 ## Now we are done with the most difficult part of the process, understanding the data and getting it ready.

## Building and Running the Model

We now have our data cleaned up, and represented in a way that Scikit will be able to analyze. To be honest the most difficult part of the process is done.

We now need to split our columns in two types:
- **features** represent the data we use to build our guess
- **target variable** the thing our model hopes to guess

In [5]:
data.columns

Index(['Number', 'City', 'Gender', 'Age', 'Income', 'Illness'], dtype='object')

In [6]:
#all of the following columns are features, we'll make a list of their names
features = ['Age', 'Income', 'City_Austin', 'City_Boston', 'City_Dallas',
       'City_Los Angeles', 'City_Mountain View', 'City_New York City',
       'City_San Diego', 'City_Washington D.C.', 'Gender_Female',
       'Gender_Male']

X = data[features]

#We want to target the ill column
y = data.Illness

KeyError: "['City_Austin', 'City_Boston', 'City_Dallas', 'City_Los Angeles', 'City_Mountain View', 'City_New York City', 'City_San Diego', 'City_Washington D.C.', 'Gender_Female', 'Gender_Male'] not in index"

In [7]:
X.shape, y.shape

NameError: name 'X' is not defined


## Training and testing

Now that we have built our model we need to get the data ready for it. We do this by breaking it into two different pieces. The diagram shows a conceptualization of how this is proportioned.

![Train Test Split](https://raw.githubusercontent.com/BrockDSL/Machine_Learning_with_Python/master/train_test.png)

- **Training set** - This is what is used to build the model. If we set this value too large the ML Model just _memorizes_ the data so we need to be careful when setting this value. This is called _overfitting_ the data.
- **Testing set** - This is used to see if our guesses are correct

Before we were looking at the **columns** of the data, this investigation of training/testing looks at the **rows** of data.



In [8]:
#Training and test together make up 100% of the data!
#We start with a baseline of 30% of our data as testing

test_percent = 30
train_percent = 100 - test_percent

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=test_percent/100.0,  random_state=23)

NameError: name 'X' is not defined

Now the interesting part, we build our model, **train** it against the **training set** and see how it **predicts** against the **testing set**

In [105]:
# Create Decision Tree classifer object
treeClass = DecisionTreeClassifier()

# Train
treeClass = treeClass.fit(X_train,y_train)

#Predict
y_pred = treeClass.predict(X_test)

## Accuracy of the Model

To see how good our machine learning model is we need to see how accurate our predictions are. `Scikit` has built in functions and [metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) to do this for us.

In [116]:
print("Accuracy: ")
print(metrics.accuracy_score(y_test,y_pred))

Accuracy: 
0.8681111111111111


## Using Different Models

In [119]:
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier

In [130]:
# Create Decision Tree classifer object
clsfier = RandomForestClassifier()

# Train
clsfier = clsfier.fit(X_train,y_train)

#Predict
y_pred = clsfier.predict(X_test)

print("Accuracy: ")
print(metrics.accuracy_score(y_test,y_pred))

Accuracy: 
0.8688222222222223


In [122]:
# Create Decision Tree classifer object
clsfier = AdaBoostClassifier()

# Train
clsfier = clsfier.fit(X_train,y_train)

#Predict
y_pred = clsfier.predict(X_test)

print("Accuracy: ")
print(metrics.accuracy_score(y_test,y_pred))

Accuracy: 
0.9191777777777778


Using default parameter values, Adaboost Classifier gives the highest accuracy. 

How about fine tuning the sam decision tree classifier. Could we get a better result?

There are many parameters that can be tuned in a decision tree classifier. See https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html.

As an example we will just tune the <b>max_depth</b>

In [138]:
for d in [1, 3, 5,10,15,20, 50, 100, None]:
    # Create Decision Tree classifer object
    clsfier = DecisionTreeClassifier(max_depth=d, random_state=23)
    
    # Train
    clsfier = clsfier.fit(X_train,y_train)
    
    #Predict
    y_pred = clsfier.predict(X_test)
    acc = metrics.accuracy_score(y_test,y_pred)
    
    print('depth', d, "Accuracy: ", acc)

depth 1 Accuracy:  0.9191777777777778
depth 3 Accuracy:  0.9191777777777778
depth 5 Accuracy:  0.9191777777777778
depth 10 Accuracy:  0.9187555555555555
depth 15 Accuracy:  0.9183555555555556
depth 20 Accuracy:  0.9168222222222222
depth 50 Accuracy:  0.877
depth 100 Accuracy:  0.8466888888888889
depth None Accuracy:  0.8468888888888889


With this dataset, the deeper the tree, the worse the accuracy. Note that this is not a universal behaviour, it may be different with different datasets

## Saving and Reloading the Model

In [142]:
# Create Decision Tree classifer object
clsfier = AdaBoostClassifier()

# Train
clsfier = clsfier.fit(X_train,y_train)

#Predict
y_pred = clsfier.predict(X_test)

print("Accuracy: ")
print(metrics.accuracy_score(y_test,y_pred))

#Let's use pickle to save the model as 
import pickle 
# save model using pickle
with open('samplemodel.pkl','wb') as f:
    pickle.dump(clsfier,f)

Accuracy: 
0.9191777777777778


In [143]:
# load the model you saved
with open('samplemodel.pkl', 'rb') as f:
    loadedModel = pickle.load(f)

In [144]:
#Predict using loaded model
y_pred = loadedModel.predict(X_test)

print("Accuracy: ")
print(metrics.accuracy_score(y_test,y_pred))

Accuracy: 
0.9191777777777778


## Exercise
Try to improve the performance of the Decision Tree further by tuning other parameters. Refer to online documentation.

Similarly, try to improve performances of AdaBoostClassifier and RandomForestClassifier by fine tuning the models.

You may also try to use other classification models available in sklearn package