# Beginning Data Analysis

Könyv 87. oldal
Although there is no standard approach when beginning a data analysis, it is typically a
good idea to develop a routine for yourself when first examining a dataset.
This
routine can manifest itself as a dynamic checklist of tasks that evolves as your familiarity
with pandas and data analysis expands.

Exploratory Data Analysis (EDA) is a term used to encompass the entire process of
analyzing data without the formal use of statistical testing procedures. Much of EDA
involves visually displaying different relationships among the data to detect interesting
patterns and develop hypotheses.

In [1]:
import pandas as pd
import numpy as np
#college = pd.read_csv('C:\Anaconda\data\college.csv')
college = pd.read_csv('https://raw.githubusercontent.com/DatasRev/source-files/master/csv/college.csv')
college.head(7)

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5
5,The University of Alabama,Tuscaloosa,AL,0.0,0.0,0.0,0,555.0,565.0,0.0,...,0.0261,0.0268,0.0026,0.0844,1,0.204,0.401,0.0853,41900,23750.0
6,Central Alabama Community College,Alexander City,AL,0.0,0.0,0.0,0,,,0.0,...,0.0,0.0,0.0019,0.3882,1,0.5892,0.3977,0.3153,27500,16127.0


In [2]:
# Number of rows and columns
college.shape

(7535, 27)

In [3]:
# List the data type of each column, number of non-missing values, and memory usage with the info method:
college.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
INSTNM                7535 non-null object
CITY                  7535 non-null object
STABBR                7535 non-null object
HBCU                  7164 non-null float64
MENONLY               7164 non-null float64
WOMENONLY             7164 non-null float64
RELAFFIL              7535 non-null int64
SATVRMID              1185 non-null float64
SATMTMID              1196 non-null float64
DISTANCEONLY          7164 non-null float64
UGDS                  6874 non-null float64
UGDS_WHITE            6874 non-null float64
UGDS_BLACK            6874 non-null float64
UGDS_HISP             6874 non-null float64
UGDS_ASIAN            6874 non-null float64
UGDS_AIAN             6874 non-null float64
UGDS_NHPI             6874 non-null float64
UGDS_2MOR             6874 non-null float64
UGDS_NRA              6874 non-null float64
UGDS_UNKN             6874 non-null float64
PPTUG_EF          

Get summary statistics for the numerical columns and transpose
the DataFrame for more readable output:

In [4]:
college.describe(include=[np.number])

Unnamed: 0,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,...,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV
count,7164.0,7164.0,7164.0,7535.0,1185.0,1196.0,7164.0,6874.0,6874.0,6874.0,...,6874.0,6874.0,6874.0,6874.0,6874.0,6853.0,7535.0,6849.0,6849.0,6718.0
mean,0.014238,0.009213,0.005304,0.190975,522.819409,530.76505,0.005583,2356.83794,0.510207,0.189997,...,0.013813,0.004569,0.02395,0.016086,0.045181,0.226639,0.923291,0.530643,0.522211,0.410021
std,0.118478,0.095546,0.072642,0.393096,68.578862,73.469767,0.074519,5474.275871,0.286958,0.224587,...,0.070196,0.033125,0.031288,0.050172,0.09344,0.24647,0.266146,0.225544,0.283616,0.228939
min,0.0,0.0,0.0,0.0,290.0,310.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,475.0,482.0,0.0,117.0,0.2675,0.036125,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.3578,0.3329,0.2415
50%,0.0,0.0,0.0,0.0,510.0,520.0,0.0,412.5,0.5557,0.10005,...,0.0026,0.0,0.0175,0.0,0.0143,0.1504,1.0,0.5215,0.5833,0.40075
75%,0.0,0.0,0.0,0.0,555.0,565.0,0.0,1929.5,0.747875,0.2577,...,0.0073,0.0025,0.0339,0.0117,0.0454,0.3769,1.0,0.7129,0.745,0.572275
max,1.0,1.0,1.0,1.0,765.0,785.0,1.0,151558.0,1.0,1.0,...,1.0,0.9983,0.5333,0.9286,0.9027,1.0,1.0,1.0,1.0,1.0


In [5]:
college.describe(include=[np.number]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
HBCU,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,1.0
MENONLY,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,1.0
WOMENONLY,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,1.0
RELAFFIL,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,1.0
SATVRMID,1185.0,522.819409,68.578862,290.0,475.0,510.0,555.0,765.0
SATMTMID,1196.0,530.76505,73.469767,310.0,482.0,520.0,565.0,785.0
DISTANCEONLY,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,1.0
UGDS,6874.0,2356.83794,5474.275871,0.0,117.0,412.5,1929.5,151558.0
UGDS_WHITE,6874.0,0.510207,0.286958,0.0,0.2675,0.5557,0.747875,1.0
UGDS_BLACK,6874.0,0.189997,0.224587,0.0,0.036125,0.10005,0.2577,1.0


In [6]:
# Get summary statistics for the object and categorical columns:
college.describe(include=[np.object, pd.Categorical]).T

Unnamed: 0,count,unique,top,freq
INSTNM,7535,7535,Dance Theatre of Harlem Inc,1
CITY,7535,2514,New York,87
STABBR,7535,59,CA,773
MD_EARN_WNE_P10,6413,598,PrivacySuppressed,822
GRAD_DEBT_MDN_SUPP,7503,2038,PrivacySuppressed,1510


Broadly speaking, we can classify data as being either continuous or
categorical. Continuous data is always numeric and can usually take on an
infinite number of possibilities such as height, weight, and salary.
Categorical data represents discrete values that take on a finite number of
possibilities such as ethnicity, employment status, and car color.
Categorical data can be represented numerically or with characters.
Categorical columns are usually going to be either of type np.object or pd.Categorical

In [7]:
college.describe(include=[np.number],
percentiles=[.01, .05, .10, .25, .5,
.75, .9, .95, .99]).T

Unnamed: 0,count,mean,std,min,1%,5%,10%,25%,50%,75%,90%,95%,99%,max
HBCU,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
MENONLY,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
WOMENONLY,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
RELAFFIL,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
SATVRMID,1185.0,522.819409,68.578862,290.0,390.0,430.0,447.4,475.0,510.0,555.0,605.0,665.0,730.0,765.0
SATMTMID,1196.0,530.76505,73.469767,310.0,395.0,430.0,453.0,482.0,520.0,565.0,630.0,685.0,745.25,785.0
DISTANCEONLY,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
UGDS,6874.0,2356.83794,5474.275871,0.0,14.0,31.65,49.0,117.0,412.5,1929.5,6512.3,11858.05,26015.29,151558.0
UGDS_WHITE,6874.0,0.510207,0.286958,0.0,0.0,0.013265,0.06879,0.2675,0.5557,0.747875,0.86297,0.927315,1.0,1.0
UGDS_BLACK,6874.0,0.189997,0.224587,0.0,0.0,0.0,0.00753,0.036125,0.10005,0.2577,0.51571,0.726715,0.961467,1.0


Data dictionaries
A crucial part of a data analysis involves creating and maintaining a data dictionary. A data
dictionary is a table of metadata and notes on each column of data. One of the primary
purposes of a data dictionary is to explain the meaning of the column names. The college
dataset uses a lot of abbreviations that are likely to be unfamiliar to an analyst who is
inspecting it for the first time.

In [8]:
collegedict = pd.read_csv('https://raw.githubusercontent.com/DatasRev/source-files/master/csv/college.csv')
collegedict.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


Reducing memory by changing data types
This recipe changes the data type of one of the object columns from the college dataset to
the special pandas Categorical data type to drastically reduce its memory usage.

In [9]:
different_cols = ['RELAFFIL', 'SATMTMID', 'CURROPER',
'INSTNM', 'STABBR']
col2 = college.loc[:, different_cols]
col2.head()

Unnamed: 0,RELAFFIL,SATMTMID,CURROPER,INSTNM,STABBR
0,0,420.0,1,Alabama A & M University,AL
1,0,565.0,1,University of Alabama at Birmingham,AL
2,1,,1,Amridge University,AL
3,0,590.0,1,University of Alabama in Huntsville,AL
4,0,430.0,1,Alabama State University,AL


In [10]:
col2.dtypes

RELAFFIL      int64
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object

In [11]:
# Find the memory usage of each column with the memory_usage method:
original_mem = col2.memory_usage(deep=True)
original_mem

Index           80
RELAFFIL     60280
SATMTMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

There is no need to use 64 bits for the RELAFFIL column as it contains only 0/1
values. Let's convert this column to an 8-bit (1 byte) integer with the astype
method:

In [12]:
col2['RELAFFIL'] = col2['RELAFFIL'].astype(np.int8)
col2.dtypes

RELAFFIL       int8
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object

#Find the memory usage of each column again and note the large reduction:
#http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.DataFrame.memory_usage.html
Pandas dataframe.memory_usage() function return the memory usage of each column in bytes. The memory usage can optionally include
the contribution of the index and elements of object dtype.

In [13]:
college[different_cols].memory_usage(deep=True)

Index           80
RELAFFIL     60280
SATMTMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

To save even more memory, you will want to consider changing object data types
to categorical if they have a reasonably low cardinality (number of unique
values). Let's first check the number of unique values for both the object columns:
The STABBR column is a good candidate to convert to Categorical as less than one
percent of its values are unique:

Különbség a unique és a distinct között: Előbbi olyan értékeket jelent, amelyek csak egyszer fordulnak elő,
a distinct pedig egy-egy érték előfordulási gyakoriságát jelenti.

In [14]:
col2.select_dtypes(include=['object']).nunique()

INSTNM    7535
STABBR      59
dtype: int64

In [15]:
col2['STABBR'] = col2['STABBR'].astype('category')
col2.dtypes

RELAFFIL        int8
SATMTMID     float64
CURROPER       int64
INSTNM        object
STABBR      category
dtype: object

In [16]:
new_mem = col2.memory_usage(deep=True)
new_mem

Index           80
RELAFFIL      7535
SATMTMID     60280
CURROPER     60280
INSTNM      660699
STABBR       13576
dtype: int64

Finally, let's compare the original memory usage with our updated memory
usage. The RELAFFIL column is, as expected, an eighth of its original, while the
STABBR column has shrunk to just three percent of its original size:

In [17]:
#new_mem / original_mem
str(round(new_mem.sum() *100/ original_mem.sum(),2))+'%' 

'62.41%'

Pandas defaults integer and float data types to 64 bits regardless of the maximum
necessary size for the particular DataFrame. Integers, floats, and even booleans may be
coerced to a different data type with the astype method and passing it the exact type,
either as a string or specific object.
Object data types can have a mix of strings, numerics, datetimes, or even other
Python objects such as lists or tuples. For this reason, the object data type is sometimes
referred to as a catch-all for a column of data that doesn't match any of the other data types.
The vast majority of the time, though, object data type columns will all be strings.

Therefore, the memory of each individual value in an object data type column is
inconsistent. There is no predefined amount of memory for each value like the other data
types. For pandas to extract the exact amount of memory of an object data type column, the
deep parameter must be set to True in the memory_usage method.
Object columns are targets for the largest memory savings. Pandas has an additional
categorical data type that is not available in NumPy. When converting to category, pandas
internally creates a mapping from integers to each unique string value. Thus, each string
only needs to be kept a single time in memory. As you can see, this simple change of data
type reduced memory usage by 97%.

Python 3 uses Unicode, a standardized character representation intended
to encode all the world's writing systems.
Not all columns can be coerced to the desired type. Take a look at the MENONLY column,
which from the data dictionary appears to contain only 0/1 values. The actual data type of
this column upon import unexpectedly turns out to be float64. The reason for this is that
there happen to be missing values, denoted by np.nan. There is no integer representation
for missing values. Any numeric column with even a single missing value must be a float.
Furthermore, any column of an integer data type will automatically be coerced to a float if
one of the values becomes missing:

In [18]:
college['MENONLY'].dtype

dtype('float64')

In [19]:
#college['MENONLY'].astype(np.int8)
#Cannot convert non-finite values (NA or inf) to integer

In [20]:
# Additionally, it is possible to substitute string names in place of Python objects when referring to data types.
college.describe(include=['int64', 'float64']).T
college.describe(include=[np.int64, np.float64]).T
college.describe(include=['int', 'float']).T
college.describe(include=['number']).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
HBCU,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,1.0
MENONLY,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,1.0
WOMENONLY,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,1.0
RELAFFIL,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,1.0
SATVRMID,1185.0,522.819409,68.578862,290.0,475.0,510.0,555.0,765.0
SATMTMID,1196.0,530.76505,73.469767,310.0,482.0,520.0,565.0,785.0
DISTANCEONLY,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,1.0
UGDS,6874.0,2356.83794,5474.275871,0.0,117.0,412.5,1929.5,151558.0
UGDS_WHITE,6874.0,0.510207,0.286958,0.0,0.2675,0.5557,0.747875,1.0
UGDS_BLACK,6874.0,0.189997,0.224587,0.0,0.036125,0.10005,0.2577,1.0


In [21]:
'''Lastly, it is possible to see the enormous memory difference between the minimal
RangeIndex and Int64Index, which stores every row index in memory:'''
college.index = pd.Int64Index(college.index)
college.index.memory_usage() # previously was just 80

60280