<h1>Project 10</h1>

<h2>Agriculture Dataset</h2>

<img src="https://cdn.pixabay.com/photo/2016/03/29/08/48/project-1287781_1280.jpg">

***First importing necessary packages***

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
sns.set()
import warnings
warnings.filterwarnings("ignore")

***Importing data which is in xlsx format***

In [None]:
train=pd.read_excel('train_agriculture.xlsx')
train.head()

In [None]:
test=pd.read_excel('test_agriculture.xlsx')
test.head()

The train and test data are given separately lets keep it that way, we will do our EDA process only on the train data and test our model on the test data

Before all this we will handle the Nan values in both the data

<h2>Start of EDA process</h2>

***Checking shape of the Dataset***

In [None]:
print('Shape of training set is',train.shape)
print('Shape of testing set is',test.shape)

Testing set is missing the column Crop_Damage which is our target variable and this is what we are going to predict in this project

In [None]:
train.info()

In [None]:
test.info()

Number_Weeks_Used has many missing values in both our train and test data, we are going to drop this column as it will affect out prediction, we will drop ID as well as it is just a numberto identity and nothing else

In [None]:
df=train[['Estimated_Insects_Count','Crop_Type','Soil_Type','Pesticide_Use_Category','Number_Doses_Week','Number_Weeks_Quit','Season','Crop_Damage']]
test=test[['Estimated_Insects_Count','Crop_Type','Soil_Type','Pesticide_Use_Category','Number_Doses_Week','Number_Weeks_Quit','Season']]
df.head()

We now have a complete dataset that we can do our analysis

***Lets visualize our data***

In [None]:
plt.hist(df['Estimated_Insects_Count'],color='r')

In [None]:
plt.hist(df['Number_Weeks_Quit'],color='g')

In [None]:
col=['Crop_Type','Soil_Type','Pesticide_Use_Category','Number_Doses_Week','Season','Crop_Damage']

for i in range(0,6):
    print(sns.countplot(df[col[i]]))
    plt.show()

***Lets check for outliers***

*First we will check for outliers visually for each attributes, for this box plot is the best option*

In [None]:
col=['Estimated_Insects_Count','Crop_Type','Soil_Type','Pesticide_Use_Category','Number_Doses_Week','Number_Weeks_Quit','Season','Crop_Damage']
for i in range(0,8):
    print(sns.boxplot(df[col[i]],color='r'))
    plt.show()

Majority of the features we have are categorical and the rest are continuously increasing data over a period of time, so we are not going to handle the outliers in this as it is not helpful

***Lets calculate correlation***

In [None]:
df.corr()

***Lets visualize using heat map as well***

In [None]:
corr = df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(12, 12))
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True,annot=True, linewidths=0.5,cmap="YlGnBu")

***Lets separate our input and output varibale***

In [None]:
x=df[['Estimated_Insects_Count','Crop_Type','Soil_Type','Pesticide_Use_Category','Number_Doses_Week','Number_Weeks_Quit','Season']]
x.head()

In [None]:
y=df[['Crop_Damage']]
y.head()

We have both of input and output attributes cleaned and in desired format

<h2>End of EDA Process</h2>

Lets start Building models to make predictions and find the model that works best on our dataset

<h2>Start of Machine Learning Process</h2>

Since out target variable is multivariant, we are going to do classification analysis

***Lets import required packages***

In [None]:
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)

In [None]:
KNN=KNeighborsClassifier(n_neighbors=6)
SV=SVC()
LR=LogisticRegression()
MNB=MultinomialNB()
DT=DecisionTreeClassifier(random_state=6)

In [None]:
models = []
models.append(('KNeighborsClassifier', KNN))
models.append(('SVC', SV))
models.append(('LogisticRegression', LR))
models.append(('DecisionTreeClassifier', DT))
models.append(('MultinomialNB',MNB))

***Lets create a loop that will execute all our models***

In [None]:
Model = []
score = []
cvs=[]
for name,model in models:
    print('*-----------------------------*',name,'*------------------------------*')
    print('\n')
    Model.append(name)
    model.fit(x_train,y_train)
    print(model)
    pre=model.predict(x_test)
    print('\n')
    AS=accuracy_score(y_test,pre)
    print('Accuracy_score = ',AS*100)
    score.append(AS*100)
    print('\n')
    sc = cross_val_score(model, x, y, cv=10, scoring='accuracy').mean()
    print('Cross_Val_Score = ',sc*100)
    cvs.append(sc*100)
    print('\n')
    print('classification_report\n',classification_report(y_test,pre))
    print('\n\n')

In [None]:
result = pd.DataFrame({'Classification Model': Model, 'Accuracy score': score ,'Cross Validation Score':cvs})
result

**We are going to choose SVC as our final model as it is giving highest accuracy with good cross validation score**

<h2>End of Machine learning Process</h2>


<h1>Now lets save our final model</h1>

In [None]:
import joblib
joblib.dump(SVC,'Agriculture.pkl')

<h2>Prediction on Test set</h2>

In [None]:
test.head()

In [None]:
Test_result=SV.predict(test)
Test_result

***Lets convert our result to dataframe and save in csv format***

In [None]:
Result=pd.DataFrame(Test_result,columns =["Crop_Damage"])
Result.head()

In [None]:
Result.to_csv('Crop_Damage_Predict.csv') 

<img src="https://knowledge.wharton.upenn.edu/wp-content/uploads/2020/05/Women-in-data-science.jpg">