In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from category_encoders import  OneHotEncoder
import matplotlib.pyplot as plt

In this notebook we are going to make the same predictions but with different algorithm. Previously we used `DecisionTreeClassifier` and now we are going to use `LogisticRegressor`.

# Data Preparetion

### Import

let's define our `wrangle` function that we'll use for most of our data clearning this time.

In [6]:
# our function will receive the path to the dataset and animal of inttrest since we'll be using only one one animal
def wrangle(path, animal):
    df = pd.read_csv(path)
    
    # subsetting our dataset base on the animal of intrest
    df = df[df['Animal'] == animal]
    return df

In [7]:
df = wrangle('datasets/animal_disease_dataset.csv','cow')
df.head()

Unnamed: 0,Animal,Age,Temperature,Symptom 1,Symptom 2,Symptom 3,Disease
0,cow,3,103.1,depression,painless lumps,loss of appetite,pneumonia
3,cow,14,100.3,loss of appetite,swelling in limb,crackling sound,blackleg
11,cow,11,103.9,depression,painless lumps,loss of appetite,lumpy virus
19,cow,14,102.7,shortness of breath,sweats,chills,anthrax
20,cow,1,103.7,depression,loss of appetite,painless lumps,lumpy virus


### Data Exploration 

lets do some exploration to gain more insight about ths dataset

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11254 entries, 0 to 43776
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Animal       11254 non-null  object 
 1   Age          11254 non-null  int64  
 2   Temperature  11254 non-null  float64
 3   Symptom 1    11254 non-null  object 
 4   Symptom 2    11254 non-null  object 
 5   Symptom 3    11254 non-null  object 
 6   Disease      11254 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 703.4+ KB


### Data Split

here I start by creating my `feature matrix (X)` and `target vector (y)`


In [11]:
X = df[df.columns[1:6]]
X.head()

Unnamed: 0,Age,Temperature,Symptom 1,Symptom 2,Symptom 3
0,3,103.1,depression,painless lumps,loss of appetite
3,14,100.3,loss of appetite,swelling in limb,crackling sound
11,11,103.9,depression,painless lumps,loss of appetite
19,14,102.7,shortness of breath,sweats,chills
20,1,103.7,depression,loss of appetite,painless lumps


In [12]:
y = df['Disease']
y.head()

0       pneumonia
3        blackleg
11    lumpy virus
19        anthrax
20    lumpy virus
Name: Disease, dtype: object

now that I have my `X` and `y`, I am going to create train and test data 

In [13]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=.2, random_state=42)

again from my `X_train` and `y_train` I am going to create validatio data

In [14]:
X_train,X_val,y_train,y_val = train_test_split(X_train,y_train, test_size=.2, random_state=42)

### Model Building

now my data are ready it time to byild my model, I will start by `Baseline model`

#### Baseline Model
my baseline model will always predict that class with most accurance 

In [15]:
acc_baseline = y_train.value_counts(normalize=True).max()
print("Baseline Accuracy:", round(acc_baseline, 2))

Baseline Accuracy: 0.22


#### My Model

I will create a pipeline that will compress my algorith and encoding algorithm instead of doing everything separately

In [24]:
# model building
model = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    LogisticRegression(max_iter=1000),
    
)
# Fit model to training data
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#### Model Evaluation

I will evaluate my model accuracy on `train`, `test` and `validation` data

In [25]:
train_acc = model.score(X_train,y_train)
test_acc = model.score(X_test,y_test)
validation_acc = model.score(X_val,y_val)

print('Traing accuracy',round(train_acc,2))
print('Testing accuracy',round(test_acc,2))
print('Validation accuracy',round(validation_acc,2))

Traing accuracy 0.83
Testing accuracy 0.82
Validation accuracy 0.82


##### Testing My model With Random data

In [26]:
def transf(Age,Temperature,Symptom1,Symptom2,Symptom3):
    y =pd.DataFrame([{
    'Age':Age,
    'Temperature':Temperature,
    'Symptom 1':Symptom1,
    'Symptom 2':Symptom2,
    'Symptom 3':Symptom3}])
    return y

result = model.predict(transf(3,101.4,'shortness of breath','sweats','chills'))[0]
print('Predicted disease:',result)

Predicted disease: anthrax
