All three tutorials are jupyter notebooks, and therefore in order to be able to run the code, you have to install that beforehand. Please make sure to have the most recent version of the Jupyter Notebook to be compatible with our code.
<br>

Follow the instructions here https://jupyter.org/install to install jupyter notebook. If you are a Windows user, you can also follow the instructions here https://www.geeksforgeeks.org/how-to-install-jupyter-notebook-in-windows/ (Using Anaconda is recommended).


Also download the Mumsnet data https://osf.io/974fc/?view_only=a1b5afe488db4014b3f21ed808bcceb9 and Reddit data https://osf.io/qp6cs/?view_only=a1b5afe488db4014b3f21ed808bcceb9 and copy and paste in ./data folder.


Now you have jupyter notebook installed, make sure to run all the code blocks and open this file. The goal of this part is to familiarize you with some basic Python syntax. Please note that we would not teach Python in our live tutorials and we assume that everyone has a basic knowledge of python. 

Here we would go through some of the most common python instructions we would use in our live tutorials. It worth mentioning that you are not supposed to remember the exact syntax of the code, and the objective of this tutorial is just to familiarize yourself with the most used commands in our tutorials.


###### what is python library?

Python library is a collection of modules that allows you to perform lots of actions without writing your own code. For example, when you are working with data, numpy, sklearn, pandas, etc. are the most commonly used libraries that facilitates functions such as reading data from file, performing various operations on data, ..


###### Pandas
Before using a library module, we have to 'import' it to our python program. For example, here we import 'pandas' module which is for data manipulation and analysis.

In [None]:
import pandas as pd

here, we use it to read data files by passing the address of the file, which is called dataframe:

In [None]:
df = pd.read_csv('./preprocessing/sample_raw_data/sample.csv')

let's take a look at the dataframe df by showing its first 10 rows:

In [None]:
df.head(10)

print shape of the dataframe and its list of features:

In [None]:
print(df.shape)
print(list(df))


selecting a specific column of our dataframe:

In [None]:
df['body']

<br>

###### some basic operations

here we go thorugh some of the basic python operations, such as loop , and if else statement.



<br>
The for loop sytanx is used to iterate over a sequence (list, tuple, string) or other iterable objects.

In [None]:
slist = [1, 4, 7, 3, 2]
for item in slist:
    print(item)

An if else sytanx evaluates whether an expression is true or false. If a condition is true, the “if” statement executes. Otherwise, the “else” statement executes:

In [None]:
for item in slist:
    if item > 3:
        print(item)
    else:
        print('Failed')

<br>

###### function
A function is a block of code that is used to perform a single or several actions. It only runs when it is called. You can pass data, known as parameters, into a function.

For example, here we define a function to print the 'body' of each row of our dataframe:

In [None]:
def printbody(df):
    for i, row in df.iterrows():
        print(row['body'])

In [None]:
printbody(df)

<br>

###### Logistic Regression
Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. Mathematically, a logistic regression model predicts P(Y=1) as a function of X (Y is the depandant variable which could be 0 or 1, and X is a set of independant variables). It is one of the simplest ML algorithms that can be used for various classification problems such as spam detection, Diabetes prediction, cancer detection etc.


Here, we provide you an example of how to use logistic regression to model your data. This sample contains 120 data points which each has two independant variables x1 and x2 (or input variables) and dependant variable y (or outcome variable). The goal is to use the data points to create a predictive model of the outcome variable.


<br>

First, read the data file:

In [None]:
df = pd.read_csv('./data/lgr_sample.csv')
df

as an optional step, we plot our data ponits to get a betterunderestanding of our data.

In [None]:

import matplotlib.pyplot as plt
import numpy as np

# separating the datapoint based on their outcome variable

# data points with the outcome variable 0 (y=0) 
x_1 = df.loc[df.y==0][['x1', 'x2']].values
# data points with the outcome variable 1 (y=1) 
x_2 = df.loc[df.y==1][['x1', 'x2']].values



# plotting data points with diff color for diff label
plt.scatter([x_1[:, 0]], [x_1[:, 1]], c='b', label='y = 0')
plt.scatter([x_2[:, 0]], [x_2[:, 1]], c='r', label='y = 1')



plt.xlabel("x1",fontsize=15)
plt.ylabel("x2",fontsize=15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.legend()
plt.show()

As you can see, your data point (red and blue) are kind of separated from each other. The goal of the logistic regression is to find a model that can separate them from each other based on the indepandant variables.

<br>

Now, we want to train our logistic regression model. The first step is to split the dataset into two sets of training set and test set. Using the training set we are learning the model, and using the test set we are evaluating the performance of our model to see how accurate would it perform when it comes to unseen data.


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.metrics import plot_roc_curve


X = df[['x1', 'x2']].to_numpy()
y = df['y'].to_numpy()



# splitting train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


# training the logistic regression model
model = LogisticRegression(random_state=0).fit(X_train, y_train)

To get a better underestanding of how logistic regression modeled our data, we plot the data points along with the decision boundary of our trained model (you don't need to underestand what each command does in this block of the code).

In [None]:
# Plot the decision boundary. For that, we will assign a color to each

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1


# Meshgrid creation
h = .02 
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh using the model.
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])    

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

# Predictions to obtain the classification results
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

# Plotting
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter([x_1[:, 0]], [x_1[:, 1]], c='b', label='y = 0')
plt.scatter([x_2[:, 0]], [x_2[:, 1]], c='r', label='y = 1')


plt.xlabel("x1",fontsize=15)
plt.ylabel("x2",fontsize=15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

predicting the outcome variable (y) of the test data points:

In [None]:
y_pred = model.predict(X_test)
print('actual labels of the test samples:%s'%y_test)
print('predicted labels for the test samples:%s'%y_pred)

calculating the accuracy and AUC of our model on test set:

In [None]:
acc = accuracy_score(y_test, y_pred)

s = model.decision_function(X_test)
auc = roc_auc_score(y_test, s)

print('test accuracy:{}'.format(acc), 'test AUC :{}'.format(auc))

The accuracy is 90% which means the model is nearly perfect and 9 out of 10 times it is correctly predicting the outcome variable.

In [None]:
plot_roc_curve(model, X_test, y_test)
plt.show()