# Warsztaty Python w Data Science

---

# "Machine Learning is done in Python, A.I. in PowerPoint."

---

## Machine Learning

### 1. Wprowadzenie. Klasyfikacja
### 2. Walidacja krzyżowa. Regresja
### 3. Optymalizacja hiperparametrów. Grid Search
### 4. Nauczanie bez nadzoru. Klasteryzacja
### 5. Reinforcement Learning
---
## Machine Learning - część 1 z 5. Wprowadzenie. Klasyfikacja  

### Wprowadzenie
#### Podstawowe pojęcia
#### Kompromis między obciążeniem a wariancją
#### Proces machine learning
#### Przykład klasyfikacji
#### Ocena wyników
#### Rodzaje klasyfikatorów
---


![AI](img\AI1.png)


https://en.wikipedia.org/wiki/Machine_learning

---

## Nauczanie Maszynowe (_Machine Learning_)

- Z nadzorem (_supervised_)
- Bez nadzoru (_unsupervised_)

## Nauczanie Maszynowe bez nadzoru
- Klasteryzacja
- Reguły asocjacyjne

## Nauczanie Maszynowe z nadzorem
- Klasyfikacja 
- Regresja

---
​
### Dla zmiennych tłumaczących `X` szukamy funkcji `f` która jak najlepiej odzwierciedli nam dane tłumaczone `y`
​
$$ 
y \approx f (X)
$$
​

---


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(10,6))
plt.style.use("dark_background")
x = np.linspace(-1, 1, 10)
plt.scatter(x, x-0.5*np.abs(x));

In [None]:
plt.figure(figsize=(10,6))
plt.style.use("dark_background")
x = np.linspace(-1, 1, 100)
plt.plot(x, x-0.6*x*x)

x = np.linspace(-1, 1, 10)
plt.scatter(x, x-0.5*np.abs(x));

In [None]:
plt.figure(figsize=(10,6))
x = np.linspace(-1, 1, 100)
plt.plot(x, x-0.6*x*x)
plt.plot(x, 1.0*x)
x = np.linspace(-1, 1, 10)
plt.scatter(x, x-0.5*np.abs(x));

In [None]:
plt.figure(figsize=(10,6))
x = np.linspace(-10, 10, 100)
plt.plot(x, x-0.6*x*x)
plt.plot(x, 1*x)
x = np.linspace(-10, 10, 25)
plt.scatter(x, x-0.5*np.abs(x));

## Kompromis między obciążeniem a wariancją
### ang. *bias-variance tradeoff*

$$
Bias~[~\hat{f}(x)~] = E~[~\hat{f}(x)~] - f(x)
$$

$$
Var~[~\hat{f}(x)~] = E~[~\hat{f}(x)^2~] - E~[~\hat{f}(x)~]^2
$$


---
## Proces nauczania w Machine Learning

1. Przygotowanie danych
2. Podział danych
3. Budowanie modelu
4. Test dokładności

https://www.kaggle.com/uciml/sms-spam-collection-dataset

In [None]:
import pandas as pd
df = pd.read_csv('data/spam.csv', encoding='ISO-8859-1')

In [None]:
df.head()

In [None]:
df.rename(columns = {'v1':'class_label', 'v2':'message'}, inplace = True)
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis = 1, inplace = True)
df

In [None]:
df['class_label'].value_counts()

In [None]:
import matplotlib.pyplot as ab
import numpy as np
labels = ['ham', 'spam']
counts = [4825, 747]
ypos = np.arange(len(labels)) #converting text labels to numberic value, 0 and 1
ypos

In [None]:
ab.xticks(ypos, labels)
ab.xlabel("class label")
ab.ylabel("Frequency")
ab.title("# of spam and ham in dataset")
ab.bar(ypos, counts);

In [None]:
df['class_label'] = df['class_label'].apply(lambda x: 1 if x == 'spam' else 0)

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df['message'], df['class_label'], test_size = 0.3, random_state = 0)

In [None]:
print('rows in test set: ' + str(x_test.shape))
print('rows in train set: ' + str(x_train.shape))

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

data = x_train.tolist()

vectorizer = TfidfVectorizer(
input= data ,  
lowercase=True,      
stop_words='english' 
)
features_train_transformed = vectorizer.fit_transform(data)  
features_test_transformed  = vectorizer.transform(x_test) 

In [None]:
df = pd.DataFrame(features_train_transformed.toarray(), columns = vectorizer.get_feature_names())
df.head()

---
# Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB
# train the model
classifier = MultinomialNB()
classifier.fit(features_train_transformed, y_train)

In [None]:
print("classifier accuracy {:.2f}%".format(classifier.score(features_test_transformed, y_test) * 100))

In [None]:
from sklearn.svm import SVC
# train the model
classifier = SVC()
classifier.fit(features_train_transformed, y_train)

In [None]:
print("classifier accuracy {:.2f}%".format(classifier.score(features_test_transformed, y_test) * 100))

![PrecisionRecall](img\Precisionrecall.svg.png)


https://en.wikipedia.org/wiki/Precision_and_recall

Przykład:
    
- 990 "ham"
- 10 "spam"


Accuracy - dokładność - % prawidłowych odpowiedzi

"Wszystko ham" - 99%

Precision - precyzja - jaki % poprawności odpowiedzi (błedy pierwszego rodzaju)

"Wszystko ham" - 99% w ham, 0% w spam

Recall - zupełność - jaki % poprawnych znalazł (błedy pierwszego rodzaju)

"Wszystko ham" - 100% w ham, 0% w spam

In [None]:
labels = classifier.predict(features_test_transformed)
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
actual = y_test.tolist()
predicted = labels
results = confusion_matrix(actual, predicted)
print('Confusion Matrix :')
print(results)
print ('Accuracy Score :',accuracy_score(actual, predicted))
print ('Report : ')
print (classification_report(actual, predicted) )
score_2 = f1_score(actual, predicted, average = 'binary')
print('F-Measure: %.3f' % score_2)

In [None]:
import seaborn as sns
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                results.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     results.flatten()/np.sum(results)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(results, annot=labels, fmt='', cmap='Reds')

---
# Support Vector Machines

In [None]:
from sklearn.svm import SVC
# train the model
classifier = SVC()
classifier.fit(features_train_transformed, y_train)

In [None]:
print("classifier accuracy {:.2f}%".format(classifier.score(features_test_transformed, y_test) * 100))

In [None]:
labels = classifier.predict(features_test_transformed)
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
actual = y_test.tolist()
predicted = labels
results = confusion_matrix(actual, predicted)
print('Confusion Matrix :')
print(results)
print ('Accuracy Score :',accuracy_score(actual, predicted))
print ('Report : ')
print (classification_report(actual, predicted) )
score_2 = f1_score(actual, predicted, average = 'binary')
print('F-Measure: %.3f' % score_2)

In [None]:
import seaborn as sns
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                results.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     results.flatten()/np.sum(results)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(results, annot=labels, fmt='', cmap='Reds')

---
# Kompromis między jakością wyniku a interpretowalnością modelu

# Drzewa decyzyjne

In [None]:
from sklearn import tree
treeclf = tree.DecisionTreeClassifier(random_state=0)
treeclf.fit(features_train_transformed, y_train)

In [None]:
print("classifier accuracy {:.2f}%".format(classifier.score(features_test_transformed, y_test) * 100))

In [None]:
import seaborn as sns
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                results.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     results.flatten()/np.sum(results)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(results, annot=labels, fmt='', cmap='Reds')

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(10,6))
plt.style.use("classic")
plt.figure(figsize=(18,12))
tree.plot_tree(treeclf, max_depth=5, feature_names=df.columns)

# Dokonane uproszczenia
- ## Niewyczyszczone dane
- ## Niezbalansowane klasy
- ## Brak walidacji krzyżowej 

(ale to za tydzień...) 

https://towardsdatascience.com/how-to-build-your-first-spam-classifier-in-10-steps-fdbf5b1b3870