# 1. Introduction to Jupyter
This is an example of markdown text. The number of "#" determines the level of narrative texts. 

demo: $\sum_i$, <b>text in bold

# first level.
## second level.
### third level.

# 1.1 Running your first program below.
We are testing how python works.

In [None]:
# This is a comment in Python. You can run this cell by pressing "Shift + Enter" 
1+2+5

In [None]:
print("hello world")

Let's review some data structures in Python: List, String, and Dictionary.

In [None]:
num_list = [1,2,3,4,5]
str_list = ['abc','9news','yahoo9']
students = {'s12345':'John', 's23456':'Mary','s54321':'Jeff'}
# print the entire data
print(num_list)
print(str_list)
print(students)
# print the items in each different data structure.
print(num_list[0],num_list[-1])
print(str_list[0:2])
print(students['s23456'])


We are going to run some code samples to introduce the main topics covered in this course. If you are not familiar with Python libraries, such as pandas, scikit, and matplotlib, please read some quick tutorials. <br>
10 minutes to Pandas: http://pandas.pydata.org/pandas-docs/stable/10min.html <br>
scikit manual:<br>
matplotlib:<br>
Bank Data: https://archive.ics.uci.edu/ml/datasets/bank+marketing

<b>Main topics covered in this course</b>:<br>
<li>Explore Data and Pre-process Data</li>
<li>Data Warehouse and OLAP</li>
<li>Mining Frequent Patterns</li>
<li>Machine Learning in Data Mining</li>
<li>Outlier Detection</li>
<li>Time Series and Sequential Data Mining</li>
<li>Text Database Mining</li>
<li>World-Wide-Web Mining</li>
<li>Data Mining on Information Networks</li>


# 2. Data Mining with Python and its amazing libraries.
## 2.1. Reading and Display Data with pandas

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt

#df = pd.read_csv("./bank.csv")
df = pd.read_csv("./bank.csv",delimiter=";")
df.head(10)

## 2.2. You can summarise the numerical features by using describe() function in pandas

In [None]:
df.describe()

## 2.3. We can use value_counts() function to have a look at a specific column

In [None]:
df['age'].value_counts()

In [None]:
df['job'].value_counts()

If you can have good visual results from data itself, we can use some visualization tools.

In [None]:
plt.hist = df['age'].hist(bins=100)
plt.pyplot.show()

In [None]:
plt.boxplot = df.boxplot(column='age')

Besides the histogram, we can use box plot, which reflects more about the data

In [None]:
plt.pyplot.show()

In this follow figure, max, min, median values of age in different groups will be drawn.

In [None]:
plt.boxplot1 = df.boxplot(column='age', by = 'marital', showfliers=False)
# you can show outliers in the figure or not to show outliers
plt.boxplot2 = df.boxplot(column='age', by = 'education', showfliers=False)
plt.pyplot.show()

Check the missing data values

In [None]:
df.apply(lambda x: sum(x.isnull()),axis =0)

In [None]:
df_med = pd.read_csv("./admissions.csv")
df_med.head()

In [None]:
df_med.apply(lambda x: sum(x.isnull()),axis =0)

We can see there are quite missing values in columns deathtime, language, etc. In our course, we will discuss how we can deal with missing data.

# 3. Predictive Models in Data Mining.
## Let's build predictive models to classify the bank data.
To predict if the client subscribed a term deposit? (Y or N) -- a typcial binary classification problem

In [None]:
df.head(5)

In [None]:
import numpy as np
import sklearn
from sklearn.metrics import *
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import minmax_scale
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer 


print(sklearn.__version__)

for col_name in df.columns:
    if(df[col_name].dtype=='object'):
        df[col_name] = df[col_name].astype('category')
        df[col_name] = df[col_name].cat.codes

X1 = np.array(df.values[:,0:9])
X2 = np.array(df.values[:,11:-1])
X = np.concatenate((X1,X2), axis=1)

y = np.array(df.values[:,-1])
print(len(df['job'].unique()),len(df['marital'].unique()),len(df['education'].unique()),len(df['default'].unique()),len(df['housing'].unique()),len(df['loan'].unique()),len(df['contact'].unique()),len(df['poutcome'].unique()))


#enc = OneHotEncoder(n_values = [12, 3, 4, 2, 2, 2, 3, 4], categorical_features=[1, 2, 3, 4, 6, 7, 8, 13])
#X = enc.fit_transform(X).toarray()

ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(),[1, 2, 3, 4, 6, 7, 8, 13])], remainder="passthrough")
X = ct.fit_transform(X)

print(X.shape)
print(X[0,:])
X_n = minmax_scale(X)
print(X_n)


liblinear_params = {'C':[0.1,1,10,100 ] }     # Linear SVC
dtree_params = {'max_depth':[5,10,15] }    # Decision Tree classifier
knn_params = {'n_neighbors': [1, 5, 10, 15]}     # KNN (K Nearest Neighbors classifier)


X_tr, X_te, ys_tr, ys_te = train_test_split( X_n, y, test_size = 0.3)
#max_iter=10000000
liblinear = LinearSVC(dual=False)
clf = GridSearchCV(liblinear, liblinear_params, cv =2, n_jobs =1, verbose =1)
clf.fit(X_tr, ys_tr)
liblinear_pred = clf.predict(X_te)
accuracy = accuracy_score(ys_te, liblinear_pred)
print("Liblinear-> The best parameter is %.3f Accuracy: %.6f"%(clf.best_params_['C'],accuracy))
    
dtree = DecisionTreeClassifier()
clf = GridSearchCV(dtree, dtree_params, cv =2, n_jobs =1, verbose =1)
clf.fit(X_tr, ys_tr)
dtree_pred = clf.predict(X_te)
accuracy = accuracy_score(ys_te, dtree_pred)
print("Decision Tree-> The best parameter is %.3f Accuracy: %.6f"%(clf.best_params_['max_depth'],accuracy)) 
        
knn = KNeighborsClassifier()
clf = GridSearchCV(knn, knn_params, cv =2, n_jobs =1, verbose =3)
clf.fit(X_tr, ys_tr)
knn_pred = clf.predict(X_te)
accuracy = accuracy_score(ys_te, knn_pred)
print("KNN -> The best parameter is %.3f Accuracy: %.6f"%(clf.best_params_['n_neighbors'],accuracy)) 


From the results above, we can see accuracies of Linear SVM, Decision Tree, and KNN.

# 4. An Example of Association Rule Mining 

In [None]:
import mlxtend
print(mlxtend.__version__)

In [None]:
#conda config --add channels conda-forge            
#conda install mlxtend
#the above two lines are for install the module: mlxtend

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
#http://pbpython.com/market-basket-analysis.html
df = pd.read_csv("./sales.csv")
df.head(10)

In [None]:
# Pre-processing data: remove white-spaces in descriptions, and drop rows without invoice number, and remove cedit transactions
df['Description'] = df['Description'].str.strip()
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]

In [None]:
print(df['Country'].value_counts())

As the dataset stores transaction records, we need to extract how many items purchased in each one transaction by using a Bag-of-Word vector, each of which represents the quantity of the item for each transaction. In this example code, we only use records from France and Germany.

In [None]:
basket = (df[df['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)

In [None]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Now, the tricky part is figuring out what this tells us. For instance, we can see that there are quite a few rules with a high lift value, which means that it occurs more frequently than would be expected, given the number of transactions and product combinations. We can also see several where the confidence is high as well. This part of the analysis is where the domain knowledge will come in handy. Since I do not have that, I’ll just look for a couple of illustrative examples.

We can filter the dataframe using standard pandas code. In this case, look for a high confidence (.9):

In [None]:
rules[ rules['confidence'] >= 0.9]

From the results, we can observe that these stuff would be purchases together most frequently.