# Data Wrangling, I
Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate open source data from the web (e.g., https://www.kaggle.com). Provide a clear 
description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe() 
function to get some initial statistics. Provide variable descriptions. Types of variables etc. 
Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking the 
data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. 
If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.


# Importing Important Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Panadas Dataframe Function for load dataset

In [2]:
df=pd.read_csv("APY.csv") # Indian Crop Production Dataset
df

Unnamed: 0,State,District,Crop,Crop_Year,Season,Area,Production,Yield
0,Andaman and Nicobar Island,NICOBARS,Arecanut,2007,Kharif,2439.6,3415.0,1.40
1,Andaman and Nicobar Island,NICOBARS,Arecanut,2007,Rabi,1626.4,2277.0,1.40
2,Andaman and Nicobar Island,NICOBARS,Arecanut,2008,Autumn,4147.0,3060.0,0.74
3,Andaman and Nicobar Island,NICOBARS,Arecanut,2008,Summer,4147.0,2660.0,0.64
4,Andaman and Nicobar Island,NICOBARS,Arecanut,2009,Autumn,4153.0,3120.0,0.75
...,...,...,...,...,...,...,...,...
345331,West Bengal,PURULIA,Wheat,2015,Rabi,855.0,1241.0,1.45
345332,West Bengal,PURULIA,Wheat,2016,Rabi,1366.0,2415.0,1.77
345333,West Bengal,PURULIA,Wheat,2017,Rabi,1052.0,2145.0,2.04
345334,West Bengal,PURULIA,Wheat,2018,Rabi,833.0,2114.0,2.54


In [32]:
df.shape  # Representing Dimensionality of dataset (Rows,Columns)

(340383, 8)

In [10]:
df.head()  # Return the First 5 rows

Unnamed: 0,State,District,Crop,Crop_Year,Season,Area,Production,Yield
0,Andaman and Nicobar Island,NICOBARS,Arecanut,2007,Kharif,2439.6,3415.0,1.4
1,Andaman and Nicobar Island,NICOBARS,Arecanut,2007,Rabi,1626.4,2277.0,1.4
2,Andaman and Nicobar Island,NICOBARS,Arecanut,2008,Autumn,4147.0,3060.0,0.74
3,Andaman and Nicobar Island,NICOBARS,Arecanut,2008,Summer,4147.0,2660.0,0.64
4,Andaman and Nicobar Island,NICOBARS,Arecanut,2009,Autumn,4153.0,3120.0,0.75


In [11]:
df.tail() # Return the Last 5 rows

Unnamed: 0,State,District,Crop,Crop_Year,Season,Area,Production,Yield
345331,West Bengal,PURULIA,Wheat,2015,Rabi,855.0,1241.0,1.45
345332,West Bengal,PURULIA,Wheat,2016,Rabi,1366.0,2415.0,1.77
345333,West Bengal,PURULIA,Wheat,2017,Rabi,1052.0,2145.0,2.04
345334,West Bengal,PURULIA,Wheat,2018,Rabi,833.0,2114.0,2.54
345335,West Bengal,PURULIA,Wheat,2019,Rabi,516.0,931.0,1.8


In [12]:
df.sample(5)  # Returns the Random 5 rows

Unnamed: 0,State,District,Crop,Crop_Year,Season,Area,Production,Yield
340631,West Bengal,MALDAH,Rapeseed &Mustard,2014,Rabi,35096.0,41871.0,1.19
78162,Chhattisgarh,SURAJPUR,Tobacco,2018,Whole Year,12.0,7.0,0.58
86304,Gujarat,KHEDA,Jowar,2014,Rabi,500.0,712.0,1.42
144671,Kerala,KOTTAYAM,Arecanut,2013,Whole Year,1581.0,1103.0,0.7
323662,Uttar Pradesh,MIRZAPUR,Urad,2014,Summer,6.0,3.0,0.5


In [43]:
df.columns   # Column Description......
# States --> Name of states
# District --> Name of districts
# Crop --> Variety of crops
# Crop_Year --> Year
# Season --> Seasons
# Area --> Area in Hectares
# Production --> Production in tonnes
# Yield --> Yield(Tonnes/Hectare)

Index(['State', 'District ', 'Crop', 'Crop_Year', 'Season', 'Area ',
       'Production', 'Yield'],
      dtype='object')

In [13]:
df.info() # This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 345336 entries, 0 to 345335
Data columns (total 8 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   State       345336 non-null  object 
 1   District    345336 non-null  object 
 2   Crop        345327 non-null  object 
 3   Crop_Year   345336 non-null  int64  
 4   Season      345336 non-null  object 
 5   Area        345336 non-null  float64
 6   Production  340388 non-null  float64
 7   Yield       345336 non-null  float64
dtypes: float64(3), int64(1), object(4)
memory usage: 21.1+ MB


In [15]:
df.dtypes  # Returns the data types in dataset

State          object
District       object
Crop           object
Crop_Year       int64
Season         object
Area          float64
Production    float64
Yield         float64
dtype: object

In [19]:
df['Production']=df['Production'].astype("int")  # astype is mainly used to change the data type in datsaet

In [20]:
df.dtypes

State          object
District       object
Crop           object
Crop_Year       int64
Season         object
Area          float64
Production      int32
Yield         float64
dtype: object

In [9]:
df.iloc[:,0:3]  # Integer -Location Based indexing for selection by position (Rows,Columns)

Unnamed: 0,State,District,Crop
0,Andaman and Nicobar Island,NICOBARS,Arecanut
1,Andaman and Nicobar Island,NICOBARS,Arecanut
2,Andaman and Nicobar Island,NICOBARS,Arecanut
3,Andaman and Nicobar Island,NICOBARS,Arecanut
4,Andaman and Nicobar Island,NICOBARS,Arecanut
...,...,...,...
345331,West Bengal,PURULIA,Wheat
345332,West Bengal,PURULIA,Wheat
345333,West Bengal,PURULIA,Wheat
345334,West Bengal,PURULIA,Wheat


In [16]:
df.loc[:,["Crop","Production"]].head(10)  # Selection By label (Rows,Columns)

Unnamed: 0,Crop,Production
0,Arecanut,3415.0
1,Arecanut,2277.0
2,Arecanut,3060.0
3,Arecanut,2660.0
4,Arecanut,3120.0
5,Arecanut,2080.0
6,Arecanut,2000.0
7,Arecanut,2061.0
8,Arecanut,2083.0
9,Arecanut,1525.0


In [39]:
df['Season'].value_counts()  #The value_counts() function is used to get a Series containing counts of unique values.

Kharif         138369
Rabi           100951
Whole Year      68680
Summer          22098
Winter           8249
Autumn           6989
Name: Season, dtype: int64

In [40]:
df['Crop'].value_counts()  #The value_counts() function is used to get a Series containing counts of unique values.

Rice                     21611
Maize                    20513
Moong(Green Gram)        15139
Urad                     14581
Sesamum                  13049
Groundnut                12586
Wheat                    11220
Rapeseed &Mustard        11034
Sugarcane                10942
Arhar/Tur                10885
Potato                   10756
Onion                    10675
Gram                     10474
Jowar                     9769
Dry chillies              8971
Bajra                     8165
Peas & beans (Pulses)     7266
Sunflower                 7244
Small millets             6985
Cotton(lint)              6475
Masoor                    6383
Linseed                   5892
Barley                    5891
Ragi                      5757
Sweet potato              5742
Other Kharif pulses       5720
Turmeric                  5607
Horse-gram                5424
Garlic                    5279
Coriander                 5037
Soyabean                  4988
Other  Rabi pulses        4866
Castor s

In [38]:
df['State'].value_counts()  #The value_counts() function is used to get a Series containing counts of unique values.

Uttar Pradesh                 44781
Madhya Pradesh                29906
Karnataka                     27493
Bihar                         24697
Rajasthan                     20363
Tamil Nadu                    18507
Assam                         18186
Maharashtra                   17922
Andhra Pradesh                16363
Odisha                        16153
Chhattisgarh                  15285
Gujarat                       14053
West Bengal                   12596
Haryana                        8305
Uttarakhand                    6702
Nagaland                       5676
Himachal Pradesh               5043
Jharkhand                      5004
Kerala                         4870
Telangana                      4684
Arunachal Pradesh              4345
Meghalaya                      4322
Jammu and Kashmir              4175
Punjab                         4142
Manipur                        3093
Tripura                        2557
Mizoram                        2102
Puducherry                  

In [21]:
df.index   #The index (row labels) of the DataFrame.

RangeIndex(start=0, stop=345336, step=1)

In [14]:
df.describe()  #The describe() method returns description of the data in the DataFrame.

Unnamed: 0,Crop_Year,Area,Production,Yield
count,345336.0,345336.0,340388.0,345336.0
mean,2008.887512,11671.47,958472.6,79.423135
std,6.564361,45840.79,21530680.0,916.678396
min,1997.0,0.004,0.0,0.0
25%,2003.0,74.0,87.0,0.55
50%,2009.0,532.0,717.0,1.0
75%,2015.0,4112.0,7182.0,2.47
max,2020.0,8580100.0,1597800000.0,43958.33


In [18]:
df.isnull().sum()  # Checking of Missing Values in Dataset

State            0
District         0
Crop             9
Crop_Year        0
Season           0
Area             0
Production    4948
Yield            0
dtype: int64

In [4]:
df.dropna(subset=["Production"],axis=0,inplace=True)  # The dropna() function is used to remove missing values.

In [6]:
df.dropna(subset=["Crop"],axis=0,inplace=True)    # The dropna() function is used to remove missing values.

In [7]:
df.isnull().sum()

State         0
District      0
Crop          0
Crop_Year     0
Season        0
Area          0
Production    0
Yield         0
dtype: int64

In [24]:
df.isnull().any()  # It will work for a DataFrame object to indicate if any value is missing

State         False
District      False
Crop           True
Crop_Year     False
Season        False
Area          False
Production     True
Yield         False
dtype: bool

In [25]:
df.duplicated().sum()  # Returns the Duplicated values in dataset

0

In [29]:
df.nunique()   # Returns the unique values in dataset

State            37
District        707
Crop             55
Crop_Year        24
Season            6
Area          47655
Production    60410
Yield         10595
dtype: int64

In [30]:
pd.get_dummies(df,columns=['Season'])  # label Encoding
# The get_dummies function is used to convert categorical variables into dummy or indicator variables. 

Unnamed: 0,State,District,Crop,Crop_Year,Area,Production,Yield,Season_Autumn,Season_Kharif,Season_Rabi,Season_Summer,Season_Whole Year,Season_Winter
0,Andaman and Nicobar Island,NICOBARS,Arecanut,2007,2439.6,3415,1.40,0,1,0,0,0,0
1,Andaman and Nicobar Island,NICOBARS,Arecanut,2007,1626.4,2277,1.40,0,0,1,0,0,0
2,Andaman and Nicobar Island,NICOBARS,Arecanut,2008,4147.0,3060,0.74,1,0,0,0,0,0
3,Andaman and Nicobar Island,NICOBARS,Arecanut,2008,4147.0,2660,0.64,0,0,0,1,0,0
4,Andaman and Nicobar Island,NICOBARS,Arecanut,2009,4153.0,3120,0.75,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
345331,West Bengal,PURULIA,Wheat,2015,855.0,1241,1.45,0,0,1,0,0,0
345332,West Bengal,PURULIA,Wheat,2016,1366.0,2415,1.77,0,0,1,0,0,0
345333,West Bengal,PURULIA,Wheat,2017,1052.0,2145,2.04,0,0,1,0,0,0
345334,West Bengal,PURULIA,Wheat,2018,833.0,2114,2.54,0,0,1,0,0,0
