# Python Foundation: Topics
> <p>
    1. Import the data from files as DataFrame <br/>
    2. Basic data analysis <br/>
    3. Data Manipulation / Data Cleaning <br/>
        - Structural <br/>
        - Content based <br/>
    4. Exploratory Analysis
 </p>

Objective of the analysis

Availabilty of the data
    - CREATE RANDOM DATA
    - DATA IMPORT
    
Structural understanding

Data understanding

Data Prep:
    Structural 
        1. Add new columns
        2. Delete some columns
        3. Rearrange the columns
        4. Subsetting (Extracting some columns)
        5. Rename the columns
        6. Change the datatypes
        
    Content based
        1. Filters
        2. Sort
        3. Removal of duplicates
        4. Data imputation
            1. Missing value treatment
            2. Outlier treatment
        5. Binning or Grouping of the data
        6. Encoding
        7. Grouping of the data/ Summaries
        8. Joins/Merge
        9. Appending
        
Exploratory analysis

## Data Frames

In [2]:
import pandas as pd

In [3]:
# create a dataframe from dict
d1 = {'Emp ID': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008], 
      'Emp Name': ['John', 'Mac', 'Andrew', 'Andrina', 'Lee', 'Sam', 'Kim', 'Leena'], 
      'Dept': ['Finance', 'HR', 'IT', 'IT', 'IT', 'IT', 'Finance', 'HR'], 
      'Salary': [10000, 12000, 10000, 12000, 14000, 13000, 10000, 18000]}
pd.DataFrame(d1, index = range(1, 9))

Unnamed: 0,Emp ID,Emp Name,Dept,Salary
1,1001,John,Finance,10000
2,1002,Mac,HR,12000
3,1003,Andrew,IT,10000
4,1004,Andrina,IT,12000
5,1005,Lee,IT,14000
6,1006,Sam,IT,13000
7,1007,Kim,Finance,10000
8,1008,Leena,HR,18000


In [6]:
# create a dataframe from pandas series
s1 = pd.Series([1001, 1002, 1003,1004, 1005,1007,1010,1003], name = 'EmpId', dtype = int, index = range(1, 9))
s2 = pd.Series(['John', 'Mac', 'Raj', 'Tim', 'Lee', 'Joe', 'Kim', 'Joe'], name = 'EmpName', dtype = object, index = range(1, 9))
s3 = pd.Series(['Finance', 'HR', 'IT', 'IT', 'IT', 'IT', 'Finance', 'HR'], name = 'Dept', dtype = object, index = range(1, 9))
s4 = pd.Series([10000, 12000, 10000, 12000, 14000, 13000, 10000, 18000], name = 'Salary', dtype = int, index = range(1, 9))
pd.DataFrame([s1, s2, s3, s4]).T

Unnamed: 0,EmpId,EmpName,Dept,Salary
1,1001,John,Finance,10000
2,1002,Mac,HR,12000
3,1003,Raj,IT,10000
4,1004,Tim,IT,12000
5,1005,Lee,IT,14000
6,1007,Joe,IT,13000
7,1010,Kim,Finance,10000
8,1003,Joe,HR,18000


## Data Import

In [7]:
# import the pandas package
import pandas as pd

In [8]:
# Import the data from files as DataFrame
stores = pd.read_csv('Data Sets/Stores.csv')

In [9]:
score = pd.read_csv('Data Sets/score.csv')

In [10]:
# import the data
stores = pd.read_csv('Data Sets/stores.csv', sep = ',')

In [11]:
pwd

'C:\\Users\\RAHUL CHHIKARA\\Desktop\\PYTHON BASIC\\class 4\\Pandas class files'

In [None]:
import os

os.chdir('/Users/RAHUL CHHIKARA/Desktop/PYTHON BASIC/class 4/Data Sets')  #change the directory 

## Basic data analysis
> 1. Check type of data structure
> 2. No of dimensions
> 3. Size of the data
> 4. Shape of the data
> 5. Columns
> 6. Count of columns
> 7. Data types of each column
> 8. No of non missing values/records in each column
> 9. Freq table of data types
> 10. Memory used
> 11. See sample records
> 12. Complete structural summary
> 13. Data descriptives

In [12]:
stores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   StoreCode        32 non-null     object 
 1   StoreName        32 non-null     object 
 2   StoreType        32 non-null     object 
 3   Location         32 non-null     object 
 4   OperatingCost    32 non-null     float64
 5   Staff_Cnt        32 non-null     int64  
 6   TotalSales       32 non-null     float64
 7   Total_Customers  32 non-null     int64  
 8   AcqCostPercust   29 non-null     float64
 9   BasketSize       32 non-null     float64
 10  ProfitPercust    32 non-null     float64
 11  OwnStore         32 non-null     int64  
 12  OnlinePresence   32 non-null     int64  
 13  Tenure           32 non-null     int64  
 14  StoreSegment     32 non-null     int64  
dtypes: float64(5), int64(6), object(4)
memory usage: 3.9+ KB


In [13]:
stores.head()  #snapshot of first 5 rows and user input to change the number rows

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,Staff_Cnt,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment
0,STR101,Electronics Zone,Electronincs,Delhi,21.0,60,160.0,110,3.9,2.62,16.46,0,1,4,4
1,STR102,Apparel Zone,Apparel,Delhi,21.0,60,160.0,110,3.9,2.875,17.02,0,1,4,4
2,STR103,Super Bazar,Super Market,Delhi,22.8,40,108.0,93,3.85,2.32,18.61,1,1,4,1
3,STR104,Super Market,Super Market,Delhi,21.4,60,258.0,110,3.08,3.215,19.44,1,0,3,1
4,STR105,Central Store,Super Market,Delhi,18.7,80,360.0,175,3.15,3.44,17.02,0,0,3,2


In [14]:
stores.tail(10) #snapshot of bottom 5 rows and user input to change the number rows

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,Staff_Cnt,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment
22,STR123,Fashion Bazar,Apparel,Mumbai,15.2,80,304.0,150,3.15,3.435,17.3,0,0,3,2
23,STR124,Digital Bazar,Electronincs,Mumbai,13.3,80,350.0,245,3.73,3.84,15.41,0,0,3,4
24,STR125,Electronics Zone,Electronincs,Kolkata,19.2,80,400.0,175,3.08,3.845,17.05,0,0,3,2
25,STR126,Apparel Zone,Apparel,Kolkata,27.3,40,79.0,66,4.08,1.935,18.9,1,1,4,1
26,STR127,Super Bazar,Super Market,Kolkata,26.0,40,120.3,91,4.43,2.14,16.7,0,1,5,2
27,STR128,Super Market,Super Market,Kolkata,30.4,40,95.1,113,3.77,1.513,16.9,1,1,5,2
28,STR129,Central Store,Super Market,Kolkata,15.8,80,351.0,264,4.22,3.17,14.5,0,1,5,4
29,STR130,Apparel Zone,Apparel,Kolkata,19.7,60,145.0,175,3.62,2.77,15.5,0,1,5,4
30,STR131,Fashion Bazar,Apparel,Kolkata,15.0,80,301.0,335,3.54,3.57,14.6,0,1,5,4
31,STR132,Digital Bazar,Electronincs,Kolkata,21.4,40,121.0,109,4.11,2.78,18.6,1,1,4,2


In [16]:
stores.describe(percentiles=[.1,.2,.3,.99]).T

Unnamed: 0,count,mean,std,min,10%,20%,30%,50%,99%,max
OperatingCost,32.0,20.090625,6.026948,10.4,14.34,15.2,15.98,19.2,33.435,33.9
Staff_Cnt,32.0,61.875,17.859216,40.0,40.0,40.0,40.0,60.0,80.0,80.0
TotalSales,32.0,230.721875,123.938694,71.1,80.61,120.14,142.06,196.3,468.28,472.0
Total_Customers,32.0,146.6875,68.562868,52.0,66.0,93.4,106.2,123.0,312.99,335.0
AcqCostPercust,29.0,3.651034,0.532664,2.76,2.986,3.122,3.218,3.73,4.79,4.93
BasketSize,32.0,3.21725,0.978457,1.513,1.9555,2.349,2.773,3.325,5.39951,5.424
ProfitPercust,32.0,17.84875,1.786943,14.5,15.534,16.734,17.02,17.71,22.0692,22.9
OwnStore,32.0,0.4375,0.504016,0.0,0.0,0.0,0.0,0.0,1.0,1.0
OnlinePresence,32.0,0.40625,0.498991,0.0,0.0,0.0,0.0,0.0,1.0,1.0
Tenure,32.0,3.6875,0.737804,3.0,3.0,3.0,3.0,4.0,5.0,5.0


## Data Manipulation / Data Cleaning - Structural
> 1. Subsetting (Extracting Column/s)
> 2. Rearrange/Reorder the column/s
> 3. Adding new column/s in dataframe
> 4. Delete column/s from dataframe
> 5. Rename the column/s
> 6. Change datatypes

### Subsetting: extracting the specific columns from the data

In [18]:
a = pd.read_excel('Data Sets/Car_data_oth.xlsx')

In [20]:
a.head(5)

Unnamed: 0,Manufacturer,Model,Type,MPGcity,MPGhighway,AirBags,DriveTrain,Cylinders,EngineSize,Horsepower,...,Length,Wheelbase,Width,Turncircle,Rearseatroom,Luggageroom,Weight,Origin,Make,Date
0,Acura,Integra,Small,25,31,,Front,4,1.8,140,...,177,102,68,37,26.5,11.0,2705,non-USA,Acura Integra,2016-12-01
1,Acura,Legend,Midsize,18,25,Driver & Passenger,Front,6,3.2,200,...,195,115,71,38,30.0,15.0,3560,non-USA,Acura Legend,2016-12-01
2,Audi,90,Compact,20,26,Driver only,Front,6,2.8,172,...,180,102,67,37,28.0,14.0,3375,non-USA,Audi 90,2016-12-01
3,Audi,100,Midsize,19,26,Driver & Passenger,Front,6,2.8,172,...,193,106,70,37,31.0,17.0,3405,non-USA,Audi 100,2016-12-01
4,BMW,535i,Midsize,22,30,Driver only,Rear,4,3.5,208,...,186,109,69,39,27.0,13.0,3640,non-USA,BMW 535i,2016-12-01


In [21]:
# method 1
stores.TotalSales.head()

0    160.0
1    160.0
2    108.0
3    258.0
4    360.0
Name: TotalSales, dtype: float64

In [23]:
# method 2
stores[['TotalSales','OperatingCost']].head(2)

Unnamed: 0,TotalSales,OperatingCost
0,160.0,21.0
1,160.0,21.0


In [24]:
# method 3
stores.iloc[:,[1,7,6]].head()

Unnamed: 0,StoreName,Total_Customers,TotalSales
0,Electronics Zone,110,160.0
1,Apparel Zone,110,160.0
2,Super Bazar,93,108.0
3,Super Market,110,258.0
4,Central Store,175,360.0


In [25]:
stores.loc[:, ['TotalSales','OperatingCost']].head(3)

Unnamed: 0,TotalSales,OperatingCost
0,160.0,21.0
1,160.0,21.0
2,108.0,22.8


### Inserting a new column in DF

In [26]:
type(stores.TotalSales)   #renders the pd.series from a dataframe

pandas.core.series.Series

In [27]:
stores['profit']=stores.TotalSales - stores.OperatingCost
stores.head()

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,Staff_Cnt,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit
0,STR101,Electronics Zone,Electronincs,Delhi,21.0,60,160.0,110,3.9,2.62,16.46,0,1,4,4,139.0
1,STR102,Apparel Zone,Apparel,Delhi,21.0,60,160.0,110,3.9,2.875,17.02,0,1,4,4,139.0
2,STR103,Super Bazar,Super Market,Delhi,22.8,40,108.0,93,3.85,2.32,18.61,1,1,4,1,85.2
3,STR104,Super Market,Super Market,Delhi,21.4,60,258.0,110,3.08,3.215,19.44,1,0,3,1,236.6
4,STR105,Central Store,Super Market,Delhi,18.7,80,360.0,175,3.15,3.44,17.02,0,0,3,2,341.3


In [29]:
stores = stores.assign(profit1 = stores.TotalSales - stores.OperatingCost, profit2 = stores.TotalSales - stores.OperatingCost)
stores.head()

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,Staff_Cnt,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit,profit1,profit2
0,STR101,Electronics Zone,Electronincs,Delhi,21.0,60,160.0,110,3.9,2.62,16.46,0,1,4,4,139.0,139.0,139.0
1,STR102,Apparel Zone,Apparel,Delhi,21.0,60,160.0,110,3.9,2.875,17.02,0,1,4,4,139.0,139.0,139.0
2,STR103,Super Bazar,Super Market,Delhi,22.8,40,108.0,93,3.85,2.32,18.61,1,1,4,1,85.2,85.2,85.2
3,STR104,Super Market,Super Market,Delhi,21.4,60,258.0,110,3.08,3.215,19.44,1,0,3,1,236.6,236.6,236.6
4,STR105,Central Store,Super Market,Delhi,18.7,80,360.0,175,3.15,3.44,17.02,0,0,3,2,341.3,341.3,341.3


### Delete a column

In [30]:
stores.drop(['Staff_Cnt'], axis = 1, inplace=True)   #default axis is 0 i.e. row level operation
# inplace=True makes the change permanent which otherwise temporary

In [31]:
stores.head()

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit,profit1,profit2
0,STR101,Electronics Zone,Electronincs,Delhi,21.0,160.0,110,3.9,2.62,16.46,0,1,4,4,139.0,139.0,139.0
1,STR102,Apparel Zone,Apparel,Delhi,21.0,160.0,110,3.9,2.875,17.02,0,1,4,4,139.0,139.0,139.0
2,STR103,Super Bazar,Super Market,Delhi,22.8,108.0,93,3.85,2.32,18.61,1,1,4,1,85.2,85.2,85.2
3,STR104,Super Market,Super Market,Delhi,21.4,258.0,110,3.08,3.215,19.44,1,0,3,1,236.6,236.6,236.6
4,STR105,Central Store,Super Market,Delhi,18.7,360.0,175,3.15,3.44,17.02,0,0,3,2,341.3,341.3,341.3


In [32]:
stores = stores.drop(['profit', 'profit1'], axis = 1)

In [33]:
stores.TotalSales.max()

472.0

In [35]:
stores.drop(stores.TotalSales.loc[stores.TotalSales>450].index)

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit2
0,STR101,Electronics Zone,Electronincs,Delhi,21.0,160.0,110,3.9,2.62,16.46,0,1,4,4,139.0
1,STR102,Apparel Zone,Apparel,Delhi,21.0,160.0,110,3.9,2.875,17.02,0,1,4,4,139.0
2,STR103,Super Bazar,Super Market,Delhi,22.8,108.0,93,3.85,2.32,18.61,1,1,4,1,85.2
3,STR104,Super Market,Super Market,Delhi,21.4,258.0,110,3.08,3.215,19.44,1,0,3,1,236.6
4,STR105,Central Store,Super Market,Delhi,18.7,360.0,175,3.15,3.44,17.02,0,0,3,2,341.3
5,STR106,Apparel Zone,Apparel,Delhi,18.1,225.0,105,2.76,3.46,20.22,1,0,3,1,206.9
6,STR107,Fashion Bazar,Apparel,Delhi,14.3,360.0,245,3.21,3.57,15.84,0,0,3,4,345.7
7,STR108,Digital Bazar,Electronincs,Delhi,24.4,146.7,62,3.69,3.19,20.0,1,0,4,2,122.3
8,STR109,Electronics Zone,Electronincs,Chennai,22.8,140.8,95,3.92,3.15,22.9,1,0,4,2,118.0
9,STR110,Apparel Zone,Apparel,Chennai,19.2,167.6,123,3.92,3.44,18.3,1,0,4,4,148.4


### Rename the columns

In [36]:
stores.rename(columns={'Total_Customers':'TotalCustomers','Staff_Cnt':'StaffCount'}).head()
# stores.rename?

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,TotalCustomers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit2
0,STR101,Electronics Zone,Electronincs,Delhi,21.0,160.0,110,3.9,2.62,16.46,0,1,4,4,139.0
1,STR102,Apparel Zone,Apparel,Delhi,21.0,160.0,110,3.9,2.875,17.02,0,1,4,4,139.0
2,STR103,Super Bazar,Super Market,Delhi,22.8,108.0,93,3.85,2.32,18.61,1,1,4,1,85.2
3,STR104,Super Market,Super Market,Delhi,21.4,258.0,110,3.08,3.215,19.44,1,0,3,1,236.6
4,STR105,Central Store,Super Market,Delhi,18.7,360.0,175,3.15,3.44,17.02,0,0,3,2,341.3


## Data Manipulation / Data Cleaning - Content based
> 1. Filters
> 2. Sorting
> 3. Removal of duplicates
> 4. Data imputaion
> 5. Encoding
> 6. Binning/creating new grouped variables
> 7. Grouping/Aggregations
> 8. Joins/Merge
> 9. Data append
> 10. Summaries

### Filtering: 
> 1. Get records from stores where Location is Delhi
> 2. Records from Kolkata where TotalSales > 100 and < 300
> 3. All store codes and store types from Chennai where OperatingCost > 15

### Filters

In [37]:
stores.head()

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit2
0,STR101,Electronics Zone,Electronincs,Delhi,21.0,160.0,110,3.9,2.62,16.46,0,1,4,4,139.0
1,STR102,Apparel Zone,Apparel,Delhi,21.0,160.0,110,3.9,2.875,17.02,0,1,4,4,139.0
2,STR103,Super Bazar,Super Market,Delhi,22.8,108.0,93,3.85,2.32,18.61,1,1,4,1,85.2
3,STR104,Super Market,Super Market,Delhi,21.4,258.0,110,3.08,3.215,19.44,1,0,3,1,236.6
4,STR105,Central Store,Super Market,Delhi,18.7,360.0,175,3.15,3.44,17.02,0,0,3,2,341.3


In [38]:
# Q1: Get records from stores where Location is Delhi
stores[stores.Location == 'Delhi']

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit2
0,STR101,Electronics Zone,Electronincs,Delhi,21.0,160.0,110,3.9,2.62,16.46,0,1,4,4,139.0
1,STR102,Apparel Zone,Apparel,Delhi,21.0,160.0,110,3.9,2.875,17.02,0,1,4,4,139.0
2,STR103,Super Bazar,Super Market,Delhi,22.8,108.0,93,3.85,2.32,18.61,1,1,4,1,85.2
3,STR104,Super Market,Super Market,Delhi,21.4,258.0,110,3.08,3.215,19.44,1,0,3,1,236.6
4,STR105,Central Store,Super Market,Delhi,18.7,360.0,175,3.15,3.44,17.02,0,0,3,2,341.3
5,STR106,Apparel Zone,Apparel,Delhi,18.1,225.0,105,2.76,3.46,20.22,1,0,3,1,206.9
6,STR107,Fashion Bazar,Apparel,Delhi,14.3,360.0,245,3.21,3.57,15.84,0,0,3,4,345.7
7,STR108,Digital Bazar,Electronincs,Delhi,24.4,146.7,62,3.69,3.19,20.0,1,0,4,2,122.3


In [39]:
# Q2: Records from Kolkata where TotalSales > 100 and < 300
stores[(stores.Location=="Kolkata") & (stores.TotalSales > 100) & (stores['TotalSales'] < 300)]

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit2
26,STR127,Super Bazar,Super Market,Kolkata,26.0,120.3,91,4.43,2.14,16.7,0,1,5,2,94.3
29,STR130,Apparel Zone,Apparel,Kolkata,19.7,145.0,175,3.62,2.77,15.5,0,1,5,4,125.3
31,STR132,Digital Bazar,Electronincs,Kolkata,21.4,121.0,109,4.11,2.78,18.6,1,1,4,2,99.6


In [40]:
# Q3: All store codes and store types from Chennai where OperatingCost > 15
stores.loc[(stores.Location  == 'Chennai') & (stores.OperatingCost>15),['Location','StoreCode','StoreType']]

Unnamed: 0,Location,StoreCode,StoreType
8,Chennai,STR109,Electronincs
9,Chennai,STR110,Apparel
10,Chennai,STR111,Super Market
11,Chennai,STR112,Super Market
12,Chennai,STR113,Super Market
13,Chennai,STR114,Apparel


### Sorting
> - sort data acc to one column  [Location asc]
> - sort data acc to one column  [TotalSales desc]
> - sort data acc to two columns [Location, OperatingCost asc]
> - sort data acc to two columns [Location, TotalSales desc]
> - sort data acc to two columns [Location in asc, TotalSales in desc]

In [41]:
stores.sort_values('Location', inplace=True)
stores.head()

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit2
15,STR116,Digital Bazar,Electronincs,Chennai,10.4,460.0,215,3.0,5.424,17.82,0,0,3,4,449.6
14,STR115,Fashion Bazar,Apparel,Chennai,10.4,472.0,205,2.93,5.25,17.98,0,0,3,4,461.6
13,STR114,Apparel Zone,Apparel,Chennai,15.2,275.8,180,,3.78,18.0,0,0,3,3,260.6
12,STR113,Central Store,Super Market,Chennai,17.3,275.8,180,,3.73,17.6,0,0,3,3,258.5
11,STR112,Super Market,Super Market,Chennai,16.4,275.8,180,,4.07,17.4,0,0,3,3,259.4


In [42]:
stores.sort_index()

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit2
0,STR101,Electronics Zone,Electronincs,Delhi,21.0,160.0,110,3.9,2.62,16.46,0,1,4,4,139.0
1,STR102,Apparel Zone,Apparel,Delhi,21.0,160.0,110,3.9,2.875,17.02,0,1,4,4,139.0
2,STR103,Super Bazar,Super Market,Delhi,22.8,108.0,93,3.85,2.32,18.61,1,1,4,1,85.2
3,STR104,Super Market,Super Market,Delhi,21.4,258.0,110,3.08,3.215,19.44,1,0,3,1,236.6
4,STR105,Central Store,Super Market,Delhi,18.7,360.0,175,3.15,3.44,17.02,0,0,3,2,341.3
5,STR106,Apparel Zone,Apparel,Delhi,18.1,225.0,105,2.76,3.46,20.22,1,0,3,1,206.9
6,STR107,Fashion Bazar,Apparel,Delhi,14.3,360.0,245,3.21,3.57,15.84,0,0,3,4,345.7
7,STR108,Digital Bazar,Electronincs,Delhi,24.4,146.7,62,3.69,3.19,20.0,1,0,4,2,122.3
8,STR109,Electronics Zone,Electronincs,Chennai,22.8,140.8,95,3.92,3.15,22.9,1,0,4,2,118.0
9,STR110,Apparel Zone,Apparel,Chennai,19.2,167.6,123,3.92,3.44,18.3,1,0,4,4,148.4


In [43]:
stores.sort_values('Location',ascending=False)

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit2
20,STR121,Central Store,Super Market,Mumbai,21.5,120.1,97,3.7,2.465,20.01,1,0,3,1,98.6
22,STR123,Fashion Bazar,Apparel,Mumbai,15.2,304.0,150,3.15,3.435,17.3,0,0,3,2,288.8
16,STR117,Electronics Zone,Electronincs,Mumbai,14.7,440.0,230,3.23,5.345,17.42,0,0,3,4,425.3
17,STR118,Apparel Zone,Apparel,Mumbai,32.4,78.7,66,4.08,2.2,19.47,1,1,4,1,46.3
18,STR119,Super Bazar,Super Market,Mumbai,30.4,75.7,52,4.93,1.615,18.52,1,1,4,2,45.3
19,STR120,Super Market,Super Market,Mumbai,33.9,71.1,65,4.22,1.835,19.9,1,1,4,1,37.2
21,STR122,Apparel Zone,Apparel,Mumbai,15.5,318.0,150,2.76,3.52,16.87,0,0,3,2,302.5
23,STR124,Digital Bazar,Electronincs,Mumbai,13.3,350.0,245,3.73,3.84,15.41,0,0,3,4,336.7
28,STR129,Central Store,Super Market,Kolkata,15.8,351.0,264,4.22,3.17,14.5,0,1,5,4,335.2
30,STR131,Fashion Bazar,Apparel,Kolkata,15.0,301.0,335,3.54,3.57,14.6,0,1,5,4,286.0


In [44]:
stores.sort_values("TotalSales", ascending = True).head()

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit2
19,STR120,Super Market,Super Market,Mumbai,33.9,71.1,65,4.22,1.835,19.9,1,1,4,1,37.2
18,STR119,Super Bazar,Super Market,Mumbai,30.4,75.7,52,4.93,1.615,18.52,1,1,4,2,45.3
17,STR118,Apparel Zone,Apparel,Mumbai,32.4,78.7,66,4.08,2.2,19.47,1,1,4,1,46.3
25,STR126,Apparel Zone,Apparel,Kolkata,27.3,79.0,66,4.08,1.935,18.9,1,1,4,1,51.7
27,STR128,Super Market,Super Market,Kolkata,30.4,95.1,113,3.77,1.513,16.9,1,1,5,2,64.7


In [45]:
stores.sort_values(['Location','OperatingCost'], ascending=False)

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit2
19,STR120,Super Market,Super Market,Mumbai,33.9,71.1,65,4.22,1.835,19.9,1,1,4,1,37.2
17,STR118,Apparel Zone,Apparel,Mumbai,32.4,78.7,66,4.08,2.2,19.47,1,1,4,1,46.3
18,STR119,Super Bazar,Super Market,Mumbai,30.4,75.7,52,4.93,1.615,18.52,1,1,4,2,45.3
20,STR121,Central Store,Super Market,Mumbai,21.5,120.1,97,3.7,2.465,20.01,1,0,3,1,98.6
21,STR122,Apparel Zone,Apparel,Mumbai,15.5,318.0,150,2.76,3.52,16.87,0,0,3,2,302.5
22,STR123,Fashion Bazar,Apparel,Mumbai,15.2,304.0,150,3.15,3.435,17.3,0,0,3,2,288.8
16,STR117,Electronics Zone,Electronincs,Mumbai,14.7,440.0,230,3.23,5.345,17.42,0,0,3,4,425.3
23,STR124,Digital Bazar,Electronincs,Mumbai,13.3,350.0,245,3.73,3.84,15.41,0,0,3,4,336.7
27,STR128,Super Market,Super Market,Kolkata,30.4,95.1,113,3.77,1.513,16.9,1,1,5,2,64.7
25,STR126,Apparel Zone,Apparel,Kolkata,27.3,79.0,66,4.08,1.935,18.9,1,1,4,1,51.7


In [46]:
stores.sort_values(["Location", "TotalSales"], ascending=False)

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit2
16,STR117,Electronics Zone,Electronincs,Mumbai,14.7,440.0,230,3.23,5.345,17.42,0,0,3,4,425.3
23,STR124,Digital Bazar,Electronincs,Mumbai,13.3,350.0,245,3.73,3.84,15.41,0,0,3,4,336.7
21,STR122,Apparel Zone,Apparel,Mumbai,15.5,318.0,150,2.76,3.52,16.87,0,0,3,2,302.5
22,STR123,Fashion Bazar,Apparel,Mumbai,15.2,304.0,150,3.15,3.435,17.3,0,0,3,2,288.8
20,STR121,Central Store,Super Market,Mumbai,21.5,120.1,97,3.7,2.465,20.01,1,0,3,1,98.6
17,STR118,Apparel Zone,Apparel,Mumbai,32.4,78.7,66,4.08,2.2,19.47,1,1,4,1,46.3
18,STR119,Super Bazar,Super Market,Mumbai,30.4,75.7,52,4.93,1.615,18.52,1,1,4,2,45.3
19,STR120,Super Market,Super Market,Mumbai,33.9,71.1,65,4.22,1.835,19.9,1,1,4,1,37.2
24,STR125,Electronics Zone,Electronincs,Kolkata,19.2,400.0,175,3.08,3.845,17.05,0,0,3,2,380.8
28,STR129,Central Store,Super Market,Kolkata,15.8,351.0,264,4.22,3.17,14.5,0,1,5,4,335.2


In [47]:
stores.sort_values(["Location", "TotalSales"], ascending = [True, False]).head(10)

Unnamed: 0,StoreCode,StoreName,StoreType,Location,OperatingCost,TotalSales,Total_Customers,AcqCostPercust,BasketSize,ProfitPercust,OwnStore,OnlinePresence,Tenure,StoreSegment,profit2
14,STR115,Fashion Bazar,Apparel,Chennai,10.4,472.0,205,2.93,5.25,17.98,0,0,3,4,461.6
15,STR116,Digital Bazar,Electronincs,Chennai,10.4,460.0,215,3.0,5.424,17.82,0,0,3,4,449.6
13,STR114,Apparel Zone,Apparel,Chennai,15.2,275.8,180,,3.78,18.0,0,0,3,3,260.6
12,STR113,Central Store,Super Market,Chennai,17.3,275.8,180,,3.73,17.6,0,0,3,3,258.5
11,STR112,Super Market,Super Market,Chennai,16.4,275.8,180,,4.07,17.4,0,0,3,3,259.4
10,STR111,Super Bazar,Super Market,Chennai,17.8,167.6,123,3.92,3.44,18.9,1,0,4,4,149.8
9,STR110,Apparel Zone,Apparel,Chennai,19.2,167.6,123,3.92,3.44,18.3,1,0,4,4,148.4
8,STR109,Electronics Zone,Electronincs,Chennai,22.8,140.8,95,3.92,3.15,22.9,1,0,4,2,118.0
6,STR107,Fashion Bazar,Apparel,Delhi,14.3,360.0,245,3.21,3.57,15.84,0,0,3,4,345.7
4,STR105,Central Store,Super Market,Delhi,18.7,360.0,175,3.15,3.44,17.02,0,0,3,2,341.3


### Remove Duplicates

In [55]:
score = pd.read_csv("Data Sets/Score.csv")
score

Unnamed: 0,Student,Section,Test1,Test2,Final
0,Capalleti,1,94,91,87
1,Dubose,2,51,65,91
2,Engles,1,95,97,97
3,Grant,2,63,75,80
4,Krupski,2,80,76,71
5,Lundsford,1,92,40,86
6,Mcbane,1,75,78,72
7,Capalleti,1,94,65,87
8,Dubose,2,51,65,91
9,Engles,1,95,97,97


In [52]:
#score['isdup'] = score.duplicated()          #finding duplicated based same value across all columns
score['isdup'] = score.duplicated(['Student']) #finding duplicated based on a particular column
#score['isdup'] = score.Student.duplicated()  #finding duplicated based on a particular column
score
#score.duplicated?

Unnamed: 0,Student,Section,Test1,Test2,Final,isdup
0,Capalleti,1,94,91,87,False
1,Dubose,2,51,65,91,False
2,Engles,1,95,97,97,False
3,Grant,2,63,75,80,False
4,Krupski,2,80,76,71,False
5,Lundsford,1,92,40,86,False
6,Mcbane,1,75,78,72,False
7,Capalleti,1,94,65,87,True
8,Dubose,2,51,65,91,True
9,Engles,1,95,97,97,True


In [57]:
score[score.duplicated()]

Unnamed: 0,Student,Section,Test1,Test2,Final
8,Dubose,2,51,65,91
9,Engles,1,95,97,97
10,Grant,2,63,75,80
11,Krupski,2,80,76,71
12,Lundsford,1,92,40,86
13,Mcbane,1,75,78,72


In [58]:
score.loc[score.duplicated(),:]

Unnamed: 0,Student,Section,Test1,Test2,Final
8,Dubose,2,51,65,91
9,Engles,1,95,97,97
10,Grant,2,63,75,80
11,Krupski,2,80,76,71
12,Lundsford,1,92,40,86
13,Mcbane,1,75,78,72


In [61]:
score.loc[-score.Student.duplicated(),:]

Unnamed: 0,Student,Section,Test1,Test2,Final
0,Capalleti,1,94,91,87
1,Dubose,2,51,65,91
2,Engles,1,95,97,97
3,Grant,2,63,75,80
4,Krupski,2,80,76,71
5,Lundsford,1,92,40,86
6,Mcbane,1,75,78,72


### Missing Data in Numeric Variables

    Fill missing values with the 
    - Drop rows where data is missing (if you have LOTS of data >1 mn rows)
    - Mean (if the distribution is symmetric)
    - Median (if the distribution is skewed)
    - Zeros (for data that indicates absence of a metric.)
    - Use predictive modelling techniques to get the app value of the missing
    
### Missing Data in Object Variables
Fill missing values with - the Mode

### Connect to the RDBMS Servers

In [62]:
import pyodbc

In [None]:
sql_conn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};SERVER=SBAL;DATABASE=db_mar_19;Trusted_Connection=yes') 

In [None]:
query = 'SELECT * FROM TBL_CUSTOMER;'
df = pd.read_sql(query, sql_conn)
#df.head()