<a href="https://colab.research.google.com/github/GioGio2004/comunication/blob/main/machine_learning_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

imports three commonly used libraries in data science and machine learning:

NumPy (np):

NumPy stands for Numerical Python. It's a fundamental library for scientific computing in Python. It provides efficient data structures like arrays and matrices, along with mathematical functions for numerical operations.
Some common uses of NumPy include:
Creating and manipulating multidimensional arrays.
Performing linear algebra operations (matrix multiplication, vector dot products, etc.).
Implementing mathematical functions and calculations.
Pandas (pd):

Pandas is a powerful library for data analysis and manipulation in Python. It builds on top of NumPy and provides high-level data structures like Series (one-dimensional) and DataFrames (two-dimensional labeled data) for handling tabular data.
Some common uses of Pandas include:
Reading and writing data from various file formats (CSV, Excel, etc.).
Data cleaning and pre-processing (handling missing values, outliers, etc.).
Exploratory data analysis (calculating statistics, plotting visualizations).
Data wrangling and transformation.
Matplotlib.pyplot (plt):

Matplotlib is a comprehensive library for creating visualizations in Python. The pyplot submodule (plt) provides a convenient interface for creating various plots like line charts, scatter plots, histograms, and more.
Some common uses of Matplotlib.pyplot include:
Creating line plots to visualize trends in data.
Generating scatter plots to explore relationships between variables.
Creating histograms to understand data distribution.
Customizing plots with colors, labels, legends, etc.
By importing these libraries, you equip yourself with essential tools for various data science tasks:

NumPy for numerical computations and array manipulation.
Pandas for data loading, cleaning, analysis, and transformation.
Matplotlib for creating informative visualizations to understand your data better.
I hope this explanation clarifies the purpose of these libraries in data science workflows. If you have any further questions or specific examples you'd like to explore, feel free to ask!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


Defines Column Names

A list named cols is defined containing the expected column names for your CSV data. This list includes names like "fLength", "fWIdth", "fSize", etc., which likely represent features or measurements in your dataset.
Loads Data into DataFrame:

The pd.read_csv function from pandas is used to read the CSV file named 'magic04.data'.
The names argument is set to the cols list you defined earlier. This ensures that the data from the CSV file is loaded into the DataFrame using the specified column names.
If your CSV file doesn't have a header row (the first row doesn't contain column names), using names is essential to assign meaningful names to the columns during data loading.
Displays DataFrame Head:

The .head() method displays the first few rows of the loaded DataFrame (df). This gives you a glimpse of the data and its structure. By looking at the head, you can verify that the data was loaded correctly and has the expected columns.

In [2]:
cols = ["fLength", "fWIdth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]

df = pd.read_csv('magic04.data', names=cols)
df.head()


Unnamed: 0,fLength,fWIdth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g


In [3]:
df["lclass"] = (df["class"]=="g").astype(int)
# df.head()


Looping Through Features: The code iterates through a list of features (cols[:-1]), excluding the last column (likely the class label). It creates a histogram for each feature, comparing the distributions for two classes (gamma and hadron).

Selecting Data for Histograms:

df[df["class"] == 1][label] selects data where the "class" column is 1 (representing "gamma") and extracts values for the current label (feature).
df[df["class"] == 0][label] does the same for data with a class of 0 (representing "hadron").
Creating Histograms:

plt.hist() creates the histogram for each class:
color='blue' for "gamma" class.
color='red' for "hadron" class.
alpha=0.7 sets transparency for visual clarity.
density=True normalizes the histograms for probability density.
Plot Formatting:

plt.title(label) sets the plot title to the current feature name.
plt.ylabel("Probability") labels the y-axis as "Probability".
plt.xlabel(label) labels the x-axis with the feature name.
plt.legend() displays a legend to distinguish the two classes.
plt.show() renders the plot.
Purpose:

The code generates a series of histograms, visualizing the distribution of each feature (except the last one) for both "gamma" and "hadron" classes. This allows you to:

Compare Distributions: Observe how features differ for the two classes, potentially helping in identifying patterns and features that might be useful for classification.
Understand Data: Get insights into the shape, spread, and central tendencies of features within each class.
Visualize Relationships: Explore potential relationships between features and the class label.
By visually comparing the histograms, you can gain a better understanding of how features relate to the classes and make informed decisions for further analysis or modeling.

In [10]:
for label in cols[:-1]:
    plt.hist(df[df["class"] == 1][label], color='blue', label='gamma', alpha=0.7, density=True)
    plt.hist(df[df["class"] == 0][label], color='red', label='hadron', alpha=0.7, density=True)
    plt.title(label)
    plt.ylabel("Probability")
    plt.xlabel(label)
    plt.legend()
    plt.show()

Train, Validation, test datasets

The code snippet you provided splits a pandas DataFrame (df) into training, validation, and test sets in a single line using NumPy's np.split function. Here's a breakdown of how it works:

Shuffling the Data (Implicit):  While not explicitly shown, the code assumes df.sample(frac=1) shuffles the DataFrame. This ensures a random selection of data points for each split. The frac=1 argument indicates using the entire dataset for shuffling.

np.split Function:

This function splits an array (in this case, the shuffled DataFrame) into a specified number of sub-arrays along a particular axis (0 for rows by default).
The provided arguments are:
df.sample(frac=1): The DataFrame to be split.
[int(0.6*len(df)), int(0.8*len(df))]: This is a list defining the split points.
The first element int(0.6*len(df)) calculates the index position for the split between training and validation sets. It takes 60% (0.6) of the DataFrame's length and converts it to an integer index.
The second element int(0.8*len(df)) calculates the index position for the split between validation and test sets. It takes 80% (0.8) of the DataFrame's length and converts it to an integer index.
Resulting Split:

The np.split function returns a tuple containing three DataFrames:
train: This will contain the first 60% of the shuffled data (training set).
valid: This will contain the next 20% of the shuffled data (validation set).
test: This will contain the remaining 20% of the shuffled data (test set).
Here's an improved version with comments for clarity



Shuffle the DataFrame (assuming 'df' is your data)
df_shuffled = df.sample(frac=1)

comment:Calculate split points for training, validation, and test sets

train_size = int(0.6 * len(df_shuffled))
valid_size = int(0.8 * len(df_shuffled)) - train_size
test_size = len(df_shuffled) - train_size - valid_size

comment:Split the DataFrame using np.split

train, valid, test = np.split(df_shuffled, [train_size, train_size + valid_size])

comment:Now you have training, validation, and test sets ready for further processing
Remember to adjust the split ratios (0.6, 0.8 in this case) based on your specific requirements and the size of your dataset.


In [7]:
train , valid, test = np.split(df.sample(frac = 1), [int(0.6* len(df)), int(0.8*len(df))])