# Agip Oil Company Limited (AOCL)

### Company Introduction: 
Your client for this project is a Oil company.

They are one of the largest Oil Company. By 2017 they had daily production of 25.451 million BOE(barrels of oil equivalent).
In 2017, this was approximately 13% of world production, which is less than several of the largest state-owned petroleum companies.
They want to increase their production in order to compete with larger oil companies.
They want to automate the process of keeping track of the state and trend of production using previous Crude Oil Data.
Their research and development teams are trying to understand properties of the previous years Crude Oil data so that they can use it to increase their production.

Current Scenario
Determining state of production as growth or decay, based upon historical crude oil data.
The traditional crude oil production state prediction methods are based on statistical analysis, extracting effective information from historical data and making reasonable judgements for classification.
Statistics based models use a sample of data to perform classification, so these can not provide accurate results.
However, designing a computer program to do this turns out to be a bit trickier.

## Problem Statement : 

The current process suffers from the following problems:

The current process is a manual classification of production state using statistical methods.
This is very tedious and time-consuming as it needs to be repeated for every new customer.

The company has hired you as a data science consultant.

They want to automate the process of predicting the production state using properties of the crude oil trend rather than doing this manual work.

Your Role:
You are given previous year Crude Oil Production data.
Your task is to build a classification model using the dataset.
Because there was no machine learning model for this problem in the company, you don’t have a quantifiable win condition. You need to build the best possible model.


# Dataset Feature Description: 

TThe following acoustic properties of each voice are measured and included in the dataset:

<table>	<th>	Column Name	</th>	<th>	Description	</th>	
<tr>	<td>	Id	</td>	<td>	Unique Id	</td>	</tr>
<tr>	<td>	month	</td>	<td>	Selected Months	</td>	</tr>
<tr>	<td>	Country	</td>	<td>	Countries(76 in Total)	</td>	</tr>
<tr>	<td>	1_diffClosing stocks(kmt)	</td>	<td>	Closing Stocks for one Month	</td>	</tr>
<tr>	<td>	1_diffExports(kmt)	</td>	<td>	Exports for one Month	</td>	</tr>
<tr>	<td>	1_diffImports(kmt)	</td>	<td>	Import for one Month	</td>	</tr>
<tr>	<td>	1_diffRefinery intake	</td>	<td>	Refinery Intake for one Month	</td>	</tr>
<tr>	<td>	1_diffWTI	</td>	<td>	West texas Intermediate Price for one Month	</td>	</tr>
<tr>	<td>	1_diffSumClosing stocks(kmt)	</td>	<td>	Sum Closing Stocks for one Month	</td>	</tr>
<tr>	<td>	1_diffSumExports(kmt)	</td>	<td>	Sum Exports for one Month	</td>	</tr>
<tr>	<td>	1_diffSumImports(kmt)	</td>	<td>	Sum Import for one Month	</td>	</tr>
<tr>	<td>	1_diffSumProduction(kmt)	</td>	<td>	Sum Production one Month	</td>	</tr>
<tr>	<td>	1_diffSumRefinery intake(kmt)	</td>	<td>	Sum Refinery Intake	</td>	</tr>
<tr>	<td>	Target	</td>	<td>	The label for the crude oil data (whether production will grow or not)	</td>	</tr>
</table>							


In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings 
warnings.filterwarnings('ignore')
from pandas_profiling import profile_report

from sklearn.metrics import confusion_matrix,f1_score,recall_score,precision_score,accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import cross_val_score,GridSearchCV,train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier,RandomForestClassifier,VotingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier


In [31]:
train_data = pd.read_csv("Oil_train.csv")
train_data.head()

Unnamed: 0,ID,month,country,1_diffClosing stocks(kmt),1_diffExports(kmt),1_diffImports(kmt),1_diffRefinery intake(kmt),1_diffWTI,1_diffSumClosing stocks(kmt),1_diffSumExports(kmt),1_diffSumImports(kmt),1_diffSumProduction(kmt),1_diffSumRefinery intake(kmt),2_diffClosing stocks(kmt),2_diffExports(kmt),2_diffImports(kmt),2_diffRefinery intake(kmt),2_diffWTI,2_diffSumClosing stocks(kmt),2_diffSumExports(kmt),2_diffSumImports(kmt),2_diffSumProduction(kmt),2_diffSumRefinery intake(kmt),3_diffClosing stocks(kmt),3_diffExports(kmt),3_diffImports(kmt),3_diffRefinery intake(kmt),3_diffWTI,3_diffSumClosing stocks(kmt),3_diffSumExports(kmt),3_diffSumImports(kmt),3_diffSumProduction(kmt),3_diffSumRefinery intake(kmt),4_diffClosing stocks(kmt),4_diffExports(kmt),4_diffImports(kmt),4_diffRefinery intake(kmt),4_diffWTI,4_diffSumClosing stocks(kmt),4_diffSumExports(kmt),4_diffSumImports(kmt),4_diffSumProduction(kmt),4_diffSumRefinery intake(kmt),5_diffClosing stocks(kmt),5_diffExports(kmt),5_diffImports(kmt),5_diffRefinery intake(kmt),5_diffWTI,5_diffSumClosing stocks(kmt),5_diffSumExports(kmt),5_diffSumImports(kmt),5_diffSumProduction(kmt),5_diffSumRefinery intake(kmt),6_diffClosing stocks(kmt),6_diffExports(kmt),6_diffImports(kmt),6_diffRefinery intake(kmt),6_diffWTI,6_diffSumClosing stocks(kmt),6_diffSumExports(kmt),6_diffSumImports(kmt),6_diffSumProduction(kmt),6_diffSumRefinery intake(kmt),7_diffClosing stocks(kmt),7_diffExports(kmt),7_diffImports(kmt),7_diffRefinery intake(kmt),7_diffWTI,7_diffSumClosing stocks(kmt),7_diffSumExports(kmt),7_diffSumImports(kmt),7_diffSumProduction(kmt),7_diffSumRefinery intake(kmt),8_diffClosing stocks(kmt),8_diffExports(kmt),8_diffImports(kmt),8_diffRefinery intake(kmt),8_diffWTI,8_diffSumClosing stocks(kmt),8_diffSumExports(kmt),8_diffSumImports(kmt),8_diffSumProduction(kmt),8_diffSumRefinery intake(kmt),9_diffClosing stocks(kmt),9_diffExports(kmt),9_diffImports(kmt),9_diffRefinery intake(kmt),9_diffWTI,9_diffSumClosing stocks(kmt),9_diffSumExports(kmt),9_diffSumImports(kmt),9_diffSumProduction(kmt),9_diffSumRefinery intake(kmt),10_diffClosing stocks(kmt),10_diffExports(kmt),10_diffImports(kmt),10_diffRefinery intake(kmt),10_diffWTI,10_diffSumClosing stocks(kmt),10_diffSumExports(kmt),10_diffSumImports(kmt),10_diffSumProduction(kmt),10_diffSumRefinery intake(kmt),11_diffClosing stocks(kmt),11_diffExports(kmt),11_diffImports(kmt),11_diffRefinery intake(kmt),11_diffWTI,11_diffSumClosing stocks(kmt),11_diffSumExports(kmt),11_diffSumImports(kmt),11_diffSumProduction(kmt),11_diffSumRefinery intake(kmt),12_diffClosing stocks(kmt),12_diffExports(kmt),12_diffImports(kmt),12_diffRefinery intake(kmt),12_diffWTI,12_diffSumClosing stocks(kmt),12_diffSumExports(kmt),12_diffSumImports(kmt),12_diffSumProduction(kmt),12_diffSumRefinery intake(kmt),Target
0,ID04188,6,46,0.0,0.0,0.0,0.0,2.62,7069.831,-3216.1655,3291.5712,-9977.5602,-5846.3237,0.0,0.0,0.0,0.0,-12.07,-2652.6804,2165.7119,-4491.3056,8211.1276,5942.2535,0.0,0.0,0.0,0.0,1.59,-1421.2644,-6194.2674,7706.7709,-10581.6649,2893.4431,0.0,0.0,0.0,0.0,3.26,1116.377,2926.9305,3217.1662,10352.5908,8773.1126,0.0,0.0,0.0,0.0,-6.92,2099.9996,-2666.4446,-3956.1835,-162.943,468.7699,0.0,0.0,0.0,0.0,8.02,-4942.9161,762.3373,-6657.2639,-7457.7408,-10467.55,0.0,0.0,0.0,0.0,1.5,4572.1869,219.2813,-3695.8343,10071.9849,-5463.0963,0.0,0.0,0.0,0.0,2.67,-5302.6713,3348.4325,3339.9561,-8501.1013,4832.3461,0.0,0.0,0.0,0.0,7.26,-2856.9324,8578.2891,5797.0895,11567.3423,11558.1583,0.0,0.0,0.0,0.0,-0.39,-1368.6205,-7199.7508,992.1046,1026.5045,-37.1761,0.0,0.0,0.0,0.0,6.11,-1109.4247,-8571.0166,-21553.7547,-27304.6951,-34199.31,0.0,0.0,0.0,0.0,9.09,9341.6818,7186.3975,22441.0501,21649.3543,27000.5446,0
1,ID02229,8,73,-117.0,0.0,-125.0,6.0,-7.4,-4477.9738,-675.9654,-3155.1122,-6110.6867,-3398.2043,10.0,0.0,20.0,14.0,0.38,-4169.4805,3348.5802,5221.9424,8194.1182,8280.4457,-14.0,0.0,-25.0,-85.0,-6.87,-5568.1857,1897.2103,523.2691,2815.5557,2466.8282,63.0,0.0,-15.0,-88.0,-9.88,410.6264,-11801.9158,-6149.1021,-11505.1199,-16479.2948,-68.0,0.0,-46.0,84.0,14.26,-347.1115,14913.5066,1755.1783,11917.8236,3333.8149,-49.0,0.0,19.0,18.0,7.17,1642.812,-6964.0468,2112.0208,-4765.5951,2166.4856,-33.0,0.0,-23.0,-21.0,-1.53,-3871.7808,8955.0574,1210.538,7768.6672,7369.5721,-13.0,0.0,-25.0,-35.0,-0.37,7872.1638,1455.8212,8591.652,4178.013,1082.3922,-1.0,0.0,47.0,-49.0,8.62,1157.7057,-4469.0027,-9726.7317,-16917.8483,-15146.8342,-32.0,0.0,-350.0,-313.0,-4.05,5542.7846,6826.3,3963.0511,20820.2155,17470.8133,0.0,0.0,-5.0,-23.0,1.86,4542.3605,-6912.5599,-3970.6252,-7432.5681,-14677.0249,-4.0,0.0,2.0,6.0,-18.37,1158.2971,5592.95,7813.3317,6655.9429,13708.031,0
2,ID01116,3,7,0.0,601.0,954.0,0.0,0.65,11955.6239,2183.3548,2914.9567,-258.3885,4412.7639,0.0,48.0,-44.0,0.0,2.92,673.8569,-13007.152,-5932.5482,-17130.8333,-13151.456,0.0,5.0,50.0,0.0,-0.33,6320.4728,19899.8242,10094.1744,16736.8301,10951.4688,0.0,-14.0,-22.0,0.0,1.56,2053.0322,-12407.3611,-6992.478,-10142.2844,345.4324,0.0,-17.0,19.0,0.0,2.59,952.4663,1703.9126,7072.1864,12766.0142,-416.3662,0.0,-99.0,-68.0,0.0,-2.98,-608.0429,-42.2514,-3416.1956,-2982.6795,-399.0076,0.0,-448.0,82.0,0.0,6.8,411.3844,12235.8057,3210.2984,10908.1097,8679.2593,0.0,539.0,4.0,0.0,-1.49,-795.922,-6970.3787,15436.3513,169.6097,3184.8428,0.0,-1.0,-553.0,0.0,7.33,-82.3617,4250.2224,-26272.2875,-8303.6222,-16628.0099,0.0,22.0,536.0,0.0,2.22,192.4645,2148.8449,4956.7036,9865.3804,6270.2905,0.0,-56.0,-17.0,0.0,-2.62,6434.5379,-1398.8519,1685.0951,-6704.0903,496.9148,0.0,73.0,33.0,0.0,-5.8,-5886.7599,6490.3996,21534.4067,2257.4031,14769.6373,1
3,ID06858,9,27,98.0,-69.0,499.0,865.0,3.04,-3143.2323,-566.6417,5975.3444,9230.8707,15595.8708,24.0,35.0,-77.0,80.0,8.39,-1105.6185,30.1372,-1947.5458,1164.3723,-1502.6444,176.0,20.0,140.0,-14.0,-4.29,192.8136,-7366.6492,-7502.3402,-10874.9151,-15372.2129,166.0,-52.0,320.0,397.0,-5.95,195.2005,7500.1033,6335.0098,12732.1449,9538.8196,153.0,-3.0,-94.0,-250.0,2.31,1655.5214,-3691.7079,-836.2419,-4159.0155,-4154.0051,-34.0,0.0,-552.0,-181.0,3.29,-5669.9954,6087.2918,-305.7071,8436.4626,12519.3416,487.0,34.0,563.0,-135.0,5.82,4256.604,-6163.9167,5479.716,-174.7942,-4042.4373,-423.0,-34.0,-1528.0,-639.0,-5.62,-403.5229,-9689.1309,-21307.0273,-27347.1868,-29017.4156,229.0,0.0,856.0,287.0,5.21,3810.5021,14021.786,14388.6486,26912.3218,22314.5198,428.0,0.0,-545.0,-827.0,-4.02,2846.7778,-3062.5934,-3528.501,-7132.5364,-8390.1861,185.0,0.0,652.0,938.0,-1.29,966.1555,4954.8497,5461.8036,10283.7521,11931.5244,-481.0,0.0,-680.0,104.0,4.43,-2504.4632,-11151.8836,-9154.3416,-10931.221,2850.7985,0
4,ID02754,6,3,16.7135,112.2191,1.4045,19.9439,12.16,830.1912,269.7015,-7511.8921,-11652.5443,-13372.9639,-169.663,-293.118,-2.2472,148.0337,13.65,-3284.4908,3898.9026,2763.9683,9378.1946,13542.7841,127.1068,15.5899,0.0,-225.2809,12.61,2786.9359,-4286.4373,-2623.4885,-7817.3203,-7381.6862,52.5281,133.7079,0.0,-14.1854,-15.79,473.7204,1000.7057,6570.3928,11479.8349,9437.7656,161.5168,141.4326,0.0,220.7865,-8.62,-2929.8903,1072.8473,504.0197,-4827.2299,3962.6288,-88.6236,-106.7416,0.0,-84.5506,-14.85,4636.3242,-12084.7079,-16072.0095,-10930.3444,-23788.694,103.5113,127.5281,0.0,-11.0955,-32.6,5108.1212,11540.4425,14204.4717,12949.1682,15777.6162,-10.1124,-19.5225,0.0,-324.2977,-12.89,371.6212,-9245.4723,-11594.1932,-14411.1174,-10187.1927,-172.8933,170.2247,0.0,260.5337,-10.61,-1972.9481,7132.4171,6530.5049,8097.2619,7605.0226,0.0,-446.6292,0.0,-99.4382,-2.87,5932.8693,-8217.7686,-1105.8695,-6166.6748,-6093.6192,-37.3595,133.7079,0.0,-265.0281,2.42,-1248.176,-12344.7158,-18690.9974,-26773.8205,-21027.8983,33.427,142.5562,0.0,290.0281,5.49,-24272.1488,10527.651,16947.2945,28844.215,22691.3429,0


In [3]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7619 entries, 0 to 7618
Columns: 124 entries, ID to Target
dtypes: float64(120), int64(3), object(1)
memory usage: 7.2+ MB


In [4]:
train_data.isnull().sum()>0

ID                                False
month                             False
country                           False
1_diffClosing stocks(kmt)          True
1_diffExports(kmt)                False
                                  ...  
12_diffSumExports(kmt)            False
12_diffSumImports(kmt)            False
12_diffSumProduction(kmt)         False
12_diffSumRefinery intake(kmt)    False
Target                            False
Length: 124, dtype: bool

In [5]:
test_data = pd.read_csv('Oil_test.csv' )
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2540 entries, 0 to 2539
Columns: 123 entries, ID to 12_diffSumRefinery intake(kmt)
dtypes: float64(120), int64(2), object(1)
memory usage: 2.4+ MB


In [6]:
'''import dtale
d= dtale.show(data=train_data)
d.open_browser() '''             

'import dtale\nd= dtale.show(data=train_data)\nd.open_browser() '

### For Checking Standard Deviation in all the columns
train_data = train_data.drop('ID', axis = 1)

drop_cols=[]

for cols in train_data.columns:
    if train_data[cols].std() == 0:
        drop_cols.append(cols)

print("Number of constant columns to be dropped: ", len(drop_cols))
print(drop_cols)

In [7]:
train_data.drop_duplicates(keep='first',inplace=True)

In [8]:
train_data[train_data.columns[train_data.isnull().sum()>0]].isnull().sum()/len(train_data)*100

1_diffClosing stocks(kmt)     2.598766
1_diffImports(kmt)            1.496259
2_diffClosing stocks(kmt)     2.572516
2_diffImports(kmt)            1.456884
3_diffClosing stocks(kmt)     2.520016
3_diffImports(kmt)            1.430634
4_diffClosing stocks(kmt)     2.493766
4_diffImports(kmt)            1.404384
5_diffClosing stocks(kmt)     2.441265
5_diffImports(kmt)            1.378134
6_diffClosing stocks(kmt)     2.349390
6_diffImports(kmt)            1.338758
7_diffClosing stocks(kmt)     2.349390
7_diffImports(kmt)            1.299383
8_diffClosing stocks(kmt)     2.362515
8_diffImports(kmt)            1.246883
9_diffClosing stocks(kmt)     2.349390
9_diffImports(kmt)            1.220633
10_diffClosing stocks(kmt)    2.323140
10_diffImports(kmt)           1.194382
11_diffClosing stocks(kmt)    2.310014
11_diffImports(kmt)           1.207508
12_diffClosing stocks(kmt)    2.218139
12_diffImports(kmt)           1.168132
dtype: float64

In [9]:
Missingcolumns = train_data[train_data.columns[train_data.isnull().sum()>0]].columns

In [10]:
NotMissingcols = set(train_data.columns) - set(Missingcolumns)  

In [11]:
train_data[NotMissingcols].isnull().sum().any()

False

In [12]:
train_data[Missingcolumns].isnull().sum().any()

True

In [13]:
pd.set_option('display.max_columns', 500)
train_data[Missingcolumns].describe()

Unnamed: 0,1_diffClosing stocks(kmt),1_diffImports(kmt),2_diffClosing stocks(kmt),2_diffImports(kmt),3_diffClosing stocks(kmt),3_diffImports(kmt),4_diffClosing stocks(kmt),4_diffImports(kmt),5_diffClosing stocks(kmt),5_diffImports(kmt),6_diffClosing stocks(kmt),6_diffImports(kmt),7_diffClosing stocks(kmt),7_diffImports(kmt),8_diffClosing stocks(kmt),8_diffImports(kmt),9_diffClosing stocks(kmt),9_diffImports(kmt),10_diffClosing stocks(kmt),10_diffImports(kmt),11_diffClosing stocks(kmt),11_diffImports(kmt),12_diffClosing stocks(kmt),12_diffImports(kmt)
count,7421.0,7505.0,7423.0,7508.0,7427.0,7510.0,7429.0,7512.0,7433.0,7514.0,7440.0,7517.0,7440.0,7520.0,7439.0,7524.0,7440.0,7526.0,7442.0,7528.0,7443.0,7527.0,7450.0,7530.0
mean,0.793335,1.806503,10.082052,9.765666,-2.008795,-4.416814,0.300337,2.604665,-4.458383,0.808337,-1.562454,6.095987,-7.354554,1.03492,2.673383,-1.137835,2.383598,5.232503,4.918354,-2.572943,-6.496416,3.151822,-0.39809,0.199018
std,673.345915,708.792852,625.337681,706.229767,853.887298,687.425761,782.088482,643.409212,854.701926,652.324294,850.928768,709.271945,765.531561,711.056755,765.688632,693.738493,846.480511,705.595079,629.462905,704.301309,849.078351,693.658597,847.618148,640.860631
min,-39400.0,-22159.0,-29600.0,-22159.0,-39400.0,-22159.0,-39400.0,-22159.0,-39400.0,-12490.0,-39400.0,-22159.0,-39400.0,-22159.0,-39400.0,-22159.0,-39400.0,-22159.0,-31600.0,-22159.0,-39400.0,-22159.0,-39400.0,-22159.0
25%,-36.0,-48.0,-36.43625,-46.868025,-38.0,-50.529725,-37.9943,-48.02055,-38.0,-48.082175,-35.057875,-47.5342,-39.0,-52.0,-37.0,-48.2192,-36.0,-49.2459,-38.0,-51.0,-38.0,-52.0,-38.0,-52.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,39.0,49.0,40.0,52.007,39.0,49.0,39.2656,51.0,41.0,50.0,43.0,52.0,40.273975,50.0,43.0,53.0,43.0,52.663375,42.0,50.082675,40.4696,52.0959,45.0,52.9661
max,7525.7143,22021.0,31600.0,22021.0,31600.0,22021.0,31600.0,12277.0,31600.0,22021.0,31600.0,22021.0,7525.7143,22021.0,31600.0,22021.0,31600.0,22021.0,31600.0,22021.0,31600.0,22021.0,31600.0,12277.0


In [14]:
# We need to replace the data with median, since this columns are not normally distributed


In [15]:
for cols in Missingcolumns:
    train_data[cols].fillna(train_data[cols].median(),inplace=True)

In [16]:
train_data[Missingcolumns].describe()

Unnamed: 0,1_diffClosing stocks(kmt),1_diffImports(kmt),2_diffClosing stocks(kmt),2_diffImports(kmt),3_diffClosing stocks(kmt),3_diffImports(kmt),4_diffClosing stocks(kmt),4_diffImports(kmt),5_diffClosing stocks(kmt),5_diffImports(kmt),6_diffClosing stocks(kmt),6_diffImports(kmt),7_diffClosing stocks(kmt),7_diffImports(kmt),8_diffClosing stocks(kmt),8_diffImports(kmt),9_diffClosing stocks(kmt),9_diffImports(kmt),10_diffClosing stocks(kmt),10_diffImports(kmt),11_diffClosing stocks(kmt),11_diffImports(kmt),12_diffClosing stocks(kmt),12_diffImports(kmt)
count,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0,7619.0
mean,0.772718,1.779474,9.82269,9.623391,-1.958173,-4.353625,0.292847,2.568085,-4.349542,0.797197,-1.525746,6.014376,-7.181767,1.021473,2.610224,-1.123648,2.327598,5.168633,4.804094,-2.542212,-6.346348,3.113764,-0.38926,0.196693
std,664.537825,703.46951,617.242813,701.066714,843.058229,682.490325,772.273894,638.874746,844.203588,647.813162,840.872247,704.50795,756.48508,706.421374,756.588724,689.399314,836.476605,701.275162,622.107759,700.082149,839.213401,689.45742,838.163513,637.106092
min,-39400.0,-22159.0,-29600.0,-22159.0,-39400.0,-22159.0,-39400.0,-22159.0,-39400.0,-12490.0,-39400.0,-22159.0,-39400.0,-22159.0,-39400.0,-22159.0,-39400.0,-22159.0,-31600.0,-22159.0,-39400.0,-22159.0,-39400.0,-22159.0
25%,-32.0,-46.0,-33.0,-45.0,-34.0,-48.0411,-34.0,-46.0,-34.6065,-46.7925,-32.9822,-46.0,-35.61575,-50.0,-33.39925,-46.285,-33.0,-47.62825,-35.0,-49.0,-35.0,-50.0,-34.97915,-50.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,36.0,47.0,37.0,50.0,35.5596,47.0,36.745,49.0,38.0,48.0,40.0,49.0,38.0,48.0,40.0,52.0,40.0,50.0,39.0,48.0524,38.0,50.0,42.0,50.77575
max,7525.7143,22021.0,31600.0,22021.0,31600.0,22021.0,31600.0,12277.0,31600.0,22021.0,31600.0,22021.0,7525.7143,22021.0,31600.0,22021.0,31600.0,22021.0,31600.0,22021.0,31600.0,22021.0,31600.0,12277.0


In [17]:
train_data.Target.value_counts()/len(train_data)*100

0    65.428534
1    34.571466
Name: Target, dtype: float64

In [24]:
X= train_data.drop(['Target','ID'],axis=1)
y=train_data['Target']

In [25]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=50,stratify=y)
print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)

(6095, 122) (6095,) (1524, 122) (1524,)


## Model1 - Logistic Regression

In [43]:
LogReg = LogisticRegression()
LogReg.fit(X_train,y_train)
P_Train = LogReg.predict(X_train)
P_Test = LogReg.predict(X_test)
print(f1_score(y_train,P_Train))
print(f1_score(y_test,P_Test))
Log_CVal = cross_val_score(estimator=LogReg,cv=10,X=X_train,y=y_train,scoring='f1')
print(Log_CVal)
print(np.mean(Log_CVal))

0.5567644276253548
0.542502387774594
[0.53395785 0.51415094 0.54587156 0.55079007 0.55474453 0.52195122
 0.52380952 0.5412844  0.53753027 0.54368932]
0.5367779675453663


## Model 2 - Decision Tree Classifier

In [53]:
DT = DecisionTreeClassifier()
DT.fit(X_train,y_train)
D_Train = DT.predict(X_train)
D_Test = DT.predict(X_test)
print(f1_score(y_train,D_Train))
print(f1_score(y_test,D_Test))
DT_CVal = cross_val_score(estimator=DT,cv=10,X=X_train,y=y_train,scoring='f1')
print(DT_CVal)
print(np.mean(DT_CVal))



1.0
0.4858757062146893
[0.52195122 0.48258706 0.53460621 0.50717703 0.49074074 0.48979592
 0.49514563 0.52516411 0.53623188 0.51954023]
0.5102940040836866


## Model 2.1 - Decision Tree Classifier using GSCV

In [64]:
DT1 = DecisionTreeClassifier()
grid = GridSearchCV(estimator=DT1, cv=10, scoring='f1',param_grid=dict(max_depth=[1,2,3,4,5,6,7,8,9,10,11,12,13,14]))
grid.fit(X_train,y_train)


GridSearchCV(cv=10, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                       13, 14]},
             scoring='f1')

## Model 3 - Random Forest Classifier

In [66]:
RF = RandomForestClassifier()
RF.fit(X_train,y_train)
R_Train = RF.predict(X_train)
R_Test = RF.predict(X_test)
print(f1_score(y_train,R_Train))
print(f1_score(y_test,R_Test))
RF_CVal = cross_val_score(estimator=RF,cv=10,X=X_train,y=y_train,scoring='f1')
print(RF_CVal)
print(np.mean(RF_CVal))

0.9997626394493234
0.5958378970427163
[0.56910569 0.55270655 0.58760108 0.58221024 0.5480226  0.55096419
 0.51014493 0.61917808 0.62433862 0.6084507 ]
0.575272268900805


## Model 3.1 - Random Forest Classifier using GSV

In [68]:
RF1 = RandomForestClassifier()
a=np.arange(1,200,20)
GSV2 = GridSearchCV(estimator=RF1, scoring='f1',cv=10,param_grid = dict(n_estimators=a))
GSV2.fit(X_train,y_train)

GridSearchCV(cv=10, estimator=RandomForestClassifier(),
             param_grid={'n_estimators': array([  1,  21,  41,  61,  81, 101, 121, 141, 161, 181])},
             scoring='f1')

In [70]:
GSV2.best_score_

0.5838555563819121

In [115]:
RF = RandomForestClassifier(n_estimators=141,class_weight='balanced')
RF.fit(X_train,y_train)
R_Train = RF.predict(X_train)
R_Test = RF.predict(X_test)
print(f1_score(y_train,R_Train))
print(f1_score(y_test,R_Test))
RF_CVal = cross_val_score(estimator=RF,cv=10,X=X_train,y=y_train,scoring='f1')
print(RF_CVal)
print(np.mean(RF_CVal))
print(accuracy_score(y_train,R_Train))
print(accuracy_score(y_test,R_Test))


1.0
0.5585168018539975
[0.53824363 0.5060241  0.5698324  0.57541899 0.5        0.52421652
 0.51785714 0.61408451 0.62258953 0.61095101]
0.5579217833537804
1.0
0.75


## Model 4 - Ada Boostclassifer

In [114]:
AD = AdaBoostClassifier(n_estimators=1)
AD.fit(X_train,y_train)
AD_Train = AD.predict(X_train)
AD_Test = AD.predict(X_test)
print(f1_score(y_train,AD_Train))
print(f1_score(y_test,AD_Test))
AD_CVal = cross_val_score(estimator=AD,cv=10,X=X_train,y=y_train,scoring='f1')
print(AD_CVal)
print(np.mean(AD_CVal))

0.5764400173235167
0.5666372462488967
[0.56896552 0.56762749 0.58872651 0.58577406 0.5619469  0.54424779
 0.57709251 0.59784946 0.57634409 0.59388646]
0.576246079639338


## Model 4.1 - Ada Boostclassifer using GSCV

In [73]:
AD1 = AdaBoostClassifier()
a=np.arange(1,200,20)
GSV3 = GridSearchCV(estimator=AD1, scoring='f1',cv=10,param_grid = dict(n_estimators=a))
GSV3.fit(X_train,y_train)

GridSearchCV(cv=10, estimator=AdaBoostClassifier(),
             param_grid={'n_estimators': array([  1,  21,  41,  61,  81, 101, 121, 141, 161, 181])},
             scoring='f1')

In [113]:
GSV3.best_params_
#GSV3.best_score_

{'n_estimators': 1}

## Model 5 - Gradient Boosting Classifier

In [80]:
GB = GradientBoostingClassifier()
GB.fit(X_train,y_train)
GB_Train = GB.predict(X_train)
GB_Test = GB.predict(X_test)
print(f1_score(y_train,GB_Train))
print(f1_score(y_test,GB_Test))
GB_CVal = cross_val_score(estimator=GB,cv=10,X=X_train,y=y_train,scoring='f1')
print(GB_CVal)
print(np.mean(GB_CVal))

0.6415094339622641
0.5743473325766176
[0.53295129 0.54022989 0.58918919 0.56353591 0.54913295 0.52542373
 0.53254438 0.62222222 0.61081081 0.62464183]
0.5690682197579735


## Model 5.1 - Gradient Classifier using GSCV

In [85]:
GB1 = GradientBoostingClassifier()
a=np.arange(1,200,20)
GSV4 = GridSearchCV(estimator=GB1, scoring='f1',cv=10,param_grid = dict(n_estimators=a))
GSV4.fit(X_train,y_train)

GridSearchCV(cv=10, estimator=GradientBoostingClassifier(),
             param_grid={'n_estimators': array([  1,  21,  41,  61,  81, 101, 121, 141, 161, 181])},
             scoring='f1')

In [83]:
GSV4.best_params_

{'n_estimators': 81}

In [84]:
GB = GradientBoostingClassifier(n_estimators=81)
GB.fit(X_train,y_train)
GB_Train = GB.predict(X_train)
GB_Test = GB.predict(X_test)
print(f1_score(y_train,GB_Train))
print(f1_score(y_test,GB_Test))
GB_CVal = cross_val_score(estimator=GB,cv=10,X=X_train,y=y_train,scoring='f1')
print(GB_CVal)
print(np.mean(GB_CVal))

0.6257110352673493
0.5769669327251996
[0.53602305 0.53061224 0.6        0.57142857 0.54227405 0.52542373
 0.52537313 0.63128492 0.6010929  0.61849711]
0.5682009708904195


In [90]:
Selector = SelectFromModel(RandomForestClassifier(n_estimators=141,class_weight='balanced'))
Selector.fit(X_train,y_train)
sel_feat = X_train.columns[Selector.get_support()].to_list()
print(sel_feat)
print(np.round(Selector.threshold_,decimals=2))

['country', '1_diffClosing stocks(kmt)', '1_diffExports(kmt)', '1_diffImports(kmt)', '1_diffRefinery intake(kmt)', '1_diffSumExports(kmt)', '1_diffSumImports(kmt)', '1_diffSumProduction(kmt)', '1_diffSumRefinery intake(kmt)', '2_diffClosing stocks(kmt)', '2_diffExports(kmt)', '2_diffImports(kmt)', '2_diffRefinery intake(kmt)', '2_diffSumProduction(kmt)', '3_diffClosing stocks(kmt)', '3_diffExports(kmt)', '3_diffImports(kmt)', '3_diffRefinery intake(kmt)', '4_diffClosing stocks(kmt)', '4_diffExports(kmt)', '4_diffImports(kmt)', '4_diffRefinery intake(kmt)', '5_diffClosing stocks(kmt)', '5_diffExports(kmt)', '5_diffImports(kmt)', '5_diffRefinery intake(kmt)', '6_diffClosing stocks(kmt)', '6_diffExports(kmt)', '6_diffImports(kmt)', '6_diffRefinery intake(kmt)', '7_diffClosing stocks(kmt)', '7_diffExports(kmt)', '7_diffImports(kmt)', '7_diffRefinery intake(kmt)', '7_diffSumRefinery intake(kmt)', '8_diffClosing stocks(kmt)', '8_diffExports(kmt)', '8_diffImports(kmt)', '8_diffRefinery intake

In [98]:
'''X_train1 = X_train[sel_feat]
X_test1 = X_test[sel_feat]'''

In [101]:
'''RF = RandomForestClassifier(n_estimators=141,class_weight='balanced')
RF.fit(X_train1,y_train)
R_Train = RF.predict(X_train1)
R_Test = RF.predict(X_test1)
print(f1_score(y_train,R_Train))
print(f1_score(y_test,R_Test))
RF_CVal = cross_val_score(estimator=RF,cv=10,X=X_train1,y=y_train,scoring='f1')
print(RF_CVal)
print(np.mean(RF_CVal))'''


"RF = RandomForestClassifier(n_estimators=141,class_weight='balanced')\nRF.fit(X_train1,y_train)\nR_Train = RF.predict(X_train1)\nR_Test = RF.predict(X_test1)\nprint(f1_score(y_train,R_Train))\nprint(f1_score(y_test,R_Test))\nRF_CVal = cross_val_score(estimator=RF,cv=10,X=X_train1,y=y_train,scoring='f1')\nprint(RF_CVal)\nprint(np.mean(RF_CVal))"

In [107]:
from sklearn.decomposition import PCA
import plotly.graph_objs as go
# Perform PCA on X (standardized features)
pca = PCA(n_components=0.90, random_state=0).fit(X_train)

# Calculate the explained variance
var = np.cumsum(np.round(a=pca.explained_variance_ratio_, decimals=3) * 100)

# Initiate an empty figure
fig = go.Figure()

# Add a trace of bar to the figure
fig.add_trace(trace=go.Scatter(x=list(range(1000)),
                               y= var,
                               name="'Cumulative Explained Variance'",
                               mode='lines+markers'))

# Update the layout with some cosmetics
fig.update_layout(height=500, 
                  width=1000, 
                  title_text='PCA Analysis', 
                  title_x=0.5,
                  xaxis_title='Number of components', 
                  yaxis_title='Explained Variance %')

# Display the figure
fig.show()

In [110]:
pca = PCA(n_components=12,random_state=42)
pca_train=pca.fit_transform(X_train)
pca_test=pca.transform(X_test)
print(pca_train.shape,pca_test.shape)

(6095, 12) (1524, 12)


In [112]:
'''RF = RandomForestClassifier(n_estimators=141,class_weight='balanced')
RF.fit(pca_train,y_train)
R_Train = RF.predict(pca_train)
R_Test = RF.predict(pca_test)
print(f1_score(y_train,R_Train))
print(f1_score(y_test,R_Test))
RF_CVal = cross_val_score(estimator=RF,cv=10,X=pca_train,y=y_train,scoring='f1')
print(RF_CVal)
print(np.mean(RF_CVal))'''


"RF = RandomForestClassifier(n_estimators=141,class_weight='balanced')\nRF.fit(pca_train,y_train)\nR_Train = RF.predict(pca_train)\nR_Test = RF.predict(pca_test)\nprint(f1_score(y_train,R_Train))\nprint(f1_score(y_test,R_Test))\nRF_CVal = cross_val_score(estimator=RF,cv=10,X=pca_train,y=y_train,scoring='f1')\nprint(RF_CVal)\nprint(np.mean(RF_CVal))"

In [116]:
# Predicting test values using ADA Booster Classifier

test_data1 = test_data.copy()

In [121]:
X1=test_data.drop("ID",axis=1)
X.shape

(2540, 122)

In [125]:
X1.isnull().sum().any()

True

In [126]:
Missingcolumns1 = X1[X1.columns[X1.isnull().sum()>0]].columns

In [127]:
for cols in Missingcolumns1:
    X1[cols].fillna(X1[cols].median(),inplace=True)

In [128]:
X1.isnull().sum().any()

False

In [129]:
Pred_Outuput = AD.predict(X1)

In [132]:
Pred_Outuput.shape

(2540,)

In [137]:
Final = pd.DataFrame(test_data['ID'])

In [139]:
Final['OP'] = Pred_Outuput

In [142]:
Final.to_csv('submission.csv', index=False,header=False)