# Python for Machine Learning

     1- Numpy
     2- Scipy
     3- Matplotlib
     4- Pandas
     5- Scikit-learn

- **Numpy**: Numerik hesaplamalar ve çok boyutlu diziler üzerinde çalışmak için kullanılır.
- **Scipy**: Bilimsel ve teknik hesaplamalar için geniş bir işlev yelpazesi sunar.
- **Matplotlib**: Veri görselleştirmesi ve grafik oluşturmak için kullanılır.
- **Pandas**: Veri analizi ve manipülasyonu için etkili bir araçtır. Tablo benzeri veri yapıları sunar.
- **Scikit-learn**: Sınıflandırma, regresyon, kümeleme gibi birçok makine öğrenimi görevini destekler.

Python ının modüllerini isimlendirilmelerde kullanamayız. Örneğin bir modülün adı "set" olamaz.

## Scikit-learn
- allows to work with Numpy and Scipy
- supports many common machine learning tasks such as classification, regression, clustering, dimensionality reduction, and model selection
- easy to use
- good docs

pre-processing of data,

feature selection,

feature extraction,

train/test splitting,

defining the algorithms,

fitting models,

tuning parameters,

prediction,

evaluation,

and exporting the model.

In [1]:
from sklearn import preprocessing

In [None]:
# pre-processing of data
X = preprocessing.StandardScaler().fit_transform(X).transform(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

In [None]:
from sklearn import svm
#Create a svm Classifier
clf = svm.SVC(gamma=0.01, C=90.0) # estimator instance

clf.fit(X_train, y_train) # fit the model

clf.predict(X_test) # predict the response

In [None]:
from sklearn.metrics import confusion_matrix # use metrics to avaluate the model accuracy
print(confusion_matrix(y_test, labels=[1,0]))

In [None]:
import pickle
# save the model
s = pickle.dumps(clf)

# Supervised vs Unsupervised

- Supervised learning: the training data you feed to the algorithm includes the desired solutions, called labels
- Unsupervised learning: the training data is unlabeled and the system tries to learn

https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data/data


In [2]:
import pandas as pd
import numpy as np

cancer_data = pd.read_csv('cancer_data.csv')

In [4]:
cancer_data.head() # show the first 5 rows

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [6]:
cancer_data.sample(5) # random sample of 5 rows

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
189,874839,B,12.3,15.9,78.83,463.7,0.0808,0.07253,0.03844,0.01654,...,19.59,86.65,546.7,0.1096,0.165,0.1423,0.04815,0.2482,0.06306,
347,89869,B,14.76,14.74,94.87,668.7,0.08875,0.0778,0.04608,0.03528,...,17.93,114.2,880.8,0.122,0.2009,0.2151,0.1251,0.3109,0.08187,
53,857392,M,18.22,18.7,120.3,1033.0,0.1148,0.1485,0.1772,0.106,...,24.13,135.1,1321.0,0.128,0.2297,0.2623,0.1325,0.3021,0.07987,
57,857793,M,14.71,21.59,95.55,656.9,0.1137,0.1365,0.1293,0.08123,...,30.7,115.7,985.5,0.1368,0.429,0.3587,0.1834,0.3698,0.1094,
142,869218,B,11.43,17.31,73.66,398.0,0.1092,0.09486,0.02031,0.01861,...,26.76,82.66,503.0,0.1413,0.1792,0.07708,0.06402,0.2584,0.08096,


In [7]:
cancer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [8]:
cancer_data["diagnosis"] = cancer_data["diagnosis"].astype('category') # change the type of the column to category

In [9]:
cancer_data.dtypes # check the type of the columns

id                            int64
diagnosis                  category
radius_mean                 float64
texture_mean                float64
perimeter_mean              float64
area_mean                   float64
smoothness_mean             float64
compactness_mean            float64
concavity_mean              float64
concave points_mean         float64
symmetry_mean               float64
fractal_dimension_mean      float64
radius_se                   float64
texture_se                  float64
perimeter_se                float64
area_se                     float64
smoothness_se               float64
compactness_se              float64
concavity_se                float64
concave points_se           float64
symmetry_se                 float64
fractal_dimension_se        float64
radius_worst                float64
texture_worst               float64
perimeter_worst             float64
area_worst                  float64
smoothness_worst            float64
compactness_worst           

- Attributes (input variables)
- Features (columns)
- Observation(rows)

In [5]:
# Attributes
cancer_data.columns[2:]

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

In [None]:
# Observation(rows) each row represents a patient

# Supervised Learning Techniques:
- Classification
- Regression

Classification modellerde ise continuous değerler için kullanılır. Regression modellerde discrete değerler için kullanılır.

Şişeye dökülen yağ miktarı continuous(sürekli) bir değişken olabilir çünkü 1,5 lt gibi kesirli bir değeri alabilir.
Ancak bir kişinin çocuk sayısı discrete(ayrık) bir değişken olabilir çünkü kişinin çocuk sayısı tam sayı olmak zorundadır ve 3.5 çocuğu olamaz.

Supervised vs Unsupervised

    - Labeled and unlabelled

- Dimension reduction: 

- Density estimation: 

- Market basket analysis:

- Clustering: 

are the most widely used unsupervised machine learning techniques.