##### Loading the dataset

In [1]:
import pandas as pd
data=pd.read_csv('dataset1.csv')

##### Dropping columns
Dropping ids as it is a not a good data member for the model to learn (no practical significance)
Dropping categories as it is not mentioned to add it furthermore it is categorical and it would increase complexity

In [2]:
data=data.drop(['id'], axis=1)
data=data.drop(['categories'], axis=1)

##### Duplacating the data 
Will be used in further processes

In [3]:
databackup=data.copy()

##### Shape of data

In [4]:
data.shape

(1000, 7)

##### Checking NaN's in the dataset

In [5]:
data.isnull().sum()

water               42
uv                  51
area                 0
fertilizer_usage     0
yield                0
pesticides           0
region               0
dtype: int64

So all other columns have no Nan's that is great for the model. Need to cater Nan's in water and uv column

##### Analyzing the water column

In [6]:
data['water'].describe()

count     958.000000
mean       12.223546
std       172.335566
min         0.072000
25%         4.584750
50%         6.476000
75%         8.758750
max      5340.000000
Name: water, dtype: float64

mean is nearly 12 and most of the data almost 90% lies below this range so it would be suitable to go with mean!

##### Filling water with mean

In [7]:
data['water'].fillna(data['water'].mean(),inplace=True)

##### Analyzing the UV column

In [8]:
data['uv'].describe()

count    949.000000
mean      73.957488
std        9.904063
min       45.264000
25%       66.502000
50%       73.689000
75%       80.554000
max      106.310000
Name: uv, dtype: float64

mean is nearly 70 and most of the data almost 50% lies near this mean. So best case to fill with mean.

##### Filling uv with mean

In [9]:
data['uv'].fillna(data['uv'].mean(),inplace=True)

## Choosing yield column as the target variable
Reason is that given all the parameters like rainfall, pesticides and etc, best thing is to calculate yield!

##### Checking linear relations of others with yeild column

In [10]:
import matplotlib.pyplot as plt
for x in ['water', 'uv', 'area', 'fertilizer_usage', 'pesticides','region']:
    plt.plot(data[x], data["yield"])
    plt.xlabel(x)
    plt.ylabel("yield")
    plt.title("linear correspndance of "+x+" with yield")
    plt.show()

<Figure size 640x480 with 1 Axes>

<Figure size 640x480 with 1 Axes>

<Figure size 640x480 with 1 Axes>

<Figure size 640x480 with 1 Axes>

<Figure size 640x480 with 1 Axes>

<Figure size 640x480 with 1 Axes>

SO all the columns have shown good realtion with the yield column i.e. they can are seperable except the water column which is not so diffrentiable as the data has a sudden change and and same before and after

##### Training the linear model and checking the results

In [11]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
#specifying the x and y
X = data[['water', 'uv', 'area', 'fertilizer_usage', 'pesticides','region']] 
y = data['yield']
#test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#linear regression model
regr = linear_model.LinearRegression()
#fitting on the training set
regr.fit(X_train, y_train)
#predicting on the test set
y_pred=regr.predict(X_test)
#intercept of predicted line
print('Intercept: \n', regr.intercept_)
#coefficients for columns
print('Coefficients: \n', regr.coef_)

Intercept: 
 -0.8256631148800011
Coefficients: 
 [-6.31615068e-04 -5.30639744e-02  6.68792314e+00  1.00585507e+01
  4.19085050e-01 -4.40550871e+00]


##### Checking the error

In [12]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

10.911573768053257

##### lets check whether the error is acceptable or not!

In [13]:
data['yield'].describe()

count    1000.000000
mean       58.758571
std        24.563683
min         2.843000
25%        40.698000
50%        55.602500
75%        73.645500
max       148.845000
Name: yield, dtype: float64

SO most of the data is on the upper side whereas the answer is comparitively small so this is a good and an acceptable model

### lets check whether dropping the nan rows could have done better

##### dropping the NAN rows

In [14]:
datanonnan=databackup.dropna()

##### Again training the linear model and checking the results

In [15]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
#specifying the x and y
X = data[['water', 'uv', 'area', 'fertilizer_usage', 'pesticides','region']] 
y = data['yield']
#test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#linear regression model
regr = linear_model.LinearRegression()
#fitting on the training set
regr.fit(X_train, y_train)
#predicting on the test set
y_pred=regr.predict(X_test)
#intercept of predicted line
print('Intercept: \n', regr.intercept_)
#coefficients for columns
print('Coefficients: \n', regr.coef_)

Intercept: 
 -0.8256631148800011
Coefficients: 
 [-6.31615068e-04 -5.30639744e-02  6.68792314e+00  1.00585507e+01
  4.19085050e-01 -4.40550871e+00]


##### Checking the error

In [16]:
mean_absolute_error(y_test, y_pred)

10.911573768053257

##### Error way to big, even worse than the old one so discarded

# ANALYSIS

#### Which region had the best yield?

In [30]:
info=data[['region','yield']]
info.groupby(['region']).sum()

Unnamed: 0_level_0,yield
region,Unnamed: 1_level_1
0,3372.008
1,8799.123
2,17511.503
3,7582.605
4,8139.08
5,2767.133
6,10587.119


#### So region 2 is the biggest producer!

#### Which region is most effected by pesticides?

In [32]:
info=data[['region','pesticides']]
info.groupby(['region']).sum()

Unnamed: 0_level_0,pesticides
region,Unnamed: 1_level_1
0,194.037
1,512.112
2,1104.663
3,397.41
4,353.712
5,221.776
6,668.591


#### So surprisingly the highest producing region has the most pesticides This information can be used by the farming department and pesticide control spray companies to look as a target market

#### Area of the regions

In [34]:
info=data[['region','area']]
info.groupby(['region']).sum()

Unnamed: 0_level_0,area
region,Unnamed: 1_level_1
0,422.954
1,1061.663
2,2214.744
3,946.155
4,1019.982
5,524.883
6,1908.467


#### So region 2 has the biggest areas and region 6 is also not far behind