# Pandas

# Series

Series: It is a one-dimensional array-like structure used to represent a dataset and can be visualized as a single row dataset. It supports multiple data types, such as Integer, string, float

Series can be created in multiple ways with the help of data elements which, if defined properly, act as data input to create a series. Therefore, data input can be an ndarray, dict, scalar, or a list. Let’s take a look at each one in detail.

Now, let’s see how we can create a series.

In [1]:
import numpy as np
import pandas as pd

### List: 

This basic Python data structure which can act as an input to create Pandas series. List can hold a range of values of multiple data types. So, if a dataset appears as list, use list as input to create series.


In [2]:
# Pass list as an argument
first_series = pd.Series(list('abcdef'))


In [3]:
print(first_series)

0    a
1    b
2    c
3    d
4    e
5    f
dtype: object


Shows index, data value and data type

We have not created index for data but notice that data alignment is done automatically

### ndarray
    
An ndarray can be used as an input to create Pandas series. The use of ndarray is recommended wherever the dataset is number-centric and requires complex numerical computing. 



In [5]:
# ndarray for countries
np_countries = np.array(['Algeria','Angola','Argentina','Australia','Austria','Bahamas','Bangladesh','Belarus','Belgium',
                      'Bhutan','Brazil','Bulgaria','Cambodia','Cameroon','Chile','China','Colombia','Cyprus','Denmark'])

In [6]:
# Pass ndarray as an argument
s_countries = pd.Series(np_countries)

In [7]:
print (s_countries[:5])

0      Algeria
1       Angola
2    Argentina
3    Australia
4      Austria
dtype: object


### dict

A Pandas series can also be created using dictionary and it is very efficient when it comes to indexing or reindexing a dataset for data wrangling purposes. dict works in a key-value fashion, so use it whenever the dataset is structured as key-value pair.


In [8]:
# dict for countries and their gdp
country_gdp = pd.Series([2255.225482,629.9553062,11601.63022,25306.82494,27266.40335,19466.99052,588.3691778,2890.345675,
                           24733.62696,1445.760002,4803.398244,2618.876037,590.4521124,665.7982328,7122.938458,2639.54156,
                           3362.4656,15378.16704,30860.12808], 
                             index = ['Algeria','Angola','Argentina','Australia','Austria','Bahamas','Bangladesh','Belarus',
                                      'Belgium','Bhutan','Brazil','Bulgaria','Cambodia','Cameroon','Chile','China','Colombia',
                                      'Cyprus','Denmark'])


In [9]:
print (country_gdp)

Algeria        2255.225482
Angola          629.955306
Argentina     11601.630220
Australia     25306.824940
Austria       27266.403350
Bahamas       19466.990520
Bangladesh      588.369178
Belarus        2890.345675
Belgium       24733.626960
Bhutan         1445.760002
Brazil         4803.398244
Bulgaria       2618.876037
Cambodia        590.452112
Cameroon        665.798233
Chile          7122.938458
China          2639.541560
Colombia       3362.465600
Cyprus        15378.167040
Denmark       30860.128080
dtype: float64


### Scalar

Scalar data is another way to create Series. It is a stand-alone quantity and works with both vector and scalar datasets that can be used accordingly.



In [10]:
scalar_series = pd.Series(5., index=['a','b','c','d','e'])

In [11]:
print (scalar_series)

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64


## Accessing Elements in Series

In [12]:
# dict for countries and their gdp
country_gdp = pd.Series([2255.225482,629.9553062,11601.63022,25306.82494,27266.40335,19466.99052,588.3691778,2890.345675,
                           24733.62696,1445.760002,4803.398244,2618.876037,590.4521124,665.7982328,7122.938458,2639.54156,
                           3362.4656,15378.16704,30860.12808], 
                             index = ['Algeria','Angola','Argentina','Australia','Austria','Bahamas','Bangladesh','Belarus',
                                      'Belgium','Bhutan','Brazil','Bulgaria','Cambodia','Cameroon','Chile','China','Colombia',
                                      'Cyprus','Denmark'])


In [13]:
country_gdp[0]

2255.225482

In [14]:
country_gdp[0:5] 

Algeria       2255.225482
Angola         629.955306
Argentina    11601.630220
Australia    25306.824940
Austria      27266.403350
dtype: float64

In [15]:
# To access data by look up method ‘loc’, pass the country name in the argument and it will return gdp per capita for the 
# country. This method is used to access elements through index values.

country_gdp.loc['Bulgaria': 'Denmark'] #lookup by name

Bulgaria     2618.876037
Cambodia      590.452112
Cameroon      665.798233
Chile        7122.938458
China        2639.541560
Colombia     3362.465600
Cyprus      15378.167040
Denmark     30860.128080
dtype: float64

In [16]:
country_gdp.loc["Bahamas":'Denmark']

Bahamas       19466.990520
Bangladesh      588.369178
Belarus        2890.345675
Belgium       24733.626960
Bhutan         1445.760002
Brazil         4803.398244
Bulgaria       2618.876037
Cambodia        590.452112
Cameroon        665.798233
Chile          7122.938458
China          2639.541560
Colombia       3362.465600
Cyprus        15378.167040
Denmark       30860.128080
dtype: float64

In [19]:
# One can also access data elements by looking up the index position by applying the ‘iloc’ method. Here, when ‘[0]’ is 
# passed in the syntax ‘dict_country_iloc[0]’, the output displays the first value for gdp per capita.

country_gdp.iloc[15] #lookup by position/index

2639.54156

## Vectorized operations

Vectorized operations show you how you can add two or more series. The vector operations are essentially performed by the index positions of data elements.

The first example shows how the two series, ‘first_vector_series’ and ‘second_vector_series’ are added and this is done at index level. 


In [17]:
first_vector_series = pd.Series([1,2,3,4], index = ['a','b','c','d']) 
second_vector_series = pd.Series([10,20,30,40], index = ['a','b','c','d'])

In [18]:
print (first_vector_series + second_vector_series)

a    11
b    22
c    33
d    44
dtype: int64


Let’s shuffle indices and see what happens. For the second vector series, we change the values of indices a, d, b, and c. Thus, when we add the two vector series, we get a different output as the data element is bound to the index position. 




In [19]:
first_vector_series = pd.Series([1,2,3,4], index = ['a','b','c','d']) 
second_vector_series = pd.Series([10,20,30,40], index = ['c','a','d','b'])

In [20]:
print (first_vector_series + second_vector_series)

a    21
b    42
c    13
d    34
dtype: int64


In [21]:
first_vector_series = pd.Series([1,2,3,4], index = ['a','b','c','d']) 
second_vector_series = pd.Series([10,20,30,40], index = ['a','b','e','f'])

In [22]:
print (first_vector_series+second_vector_series)

a    11.0
b    22.0
c     NaN
d     NaN
e     NaN
f     NaN
dtype: float64


# Dataframes

DataFrame is another core feature of the Pandas data structure.

DataFrame is a two-dimensional labeled data structure with columns of potentially different data types.

A DataFrame looks like a spreadsheet with a row-columnar structure or a SQL data table with rows and columns.

There can be several inputs to the DataFrame and we’ll go through them in detail. Let’s have a quick overview of the data inputs:

## dict

A Pandas DataFrame can also be created using dictionary of list. It is very efficient when it comes to indexing or reindexing a dataset for data wrangling purposes. 

In this example, we will create a dataset related to Summer Olympics. 

First, import the Pandas library. Then, declare a dict ‘Olympic_data_list’ and pass the indices ‘HostCity’, ‘No. of Participating Countries’, and ‘Year’ with its data elements as arguments.

As you can observe, it is a tabular representation of data with rows and columns.
Next, pass this list to the DataFrame method ‘pd.DataFrame’ to create a basic DataFrame.

Note that data alignment is automatically taken care here. When we call this DataFrame ‘df_Olympic_data’, the output displays all the rows with its corresponding indices.

In [23]:
olympic_data = {'HostCity':['London', 'Beijing', 'Athens', 'Sydney', 'Atlanta'], 'Year': [2012,2008,2004,2000,1996],
                    'No of Participating Countries': [205,205,201,200,197]}
type(olympic_data)

dict

In [24]:
df_olympic_data = pd.DataFrame(olympic_data)

In [25]:
df_olympic_data

Unnamed: 0,HostCity,Year,No of Participating Countries
0,London,2012,205
1,Beijing,2008,205
2,Athens,2004,201
3,Sydney,2000,200
4,Atlanta,1996,197


In [26]:
# View a column

df_olympic_data.HostCity

0     London
1    Beijing
2     Athens
3     Sydney
4    Atlanta
Name: HostCity, dtype: object

In [27]:
df_olympic_data[['HostCity']]

Unnamed: 0,HostCity
0,London
1,Beijing
2,Athens
3,Sydney
4,Atlanta


## dict of Series

A dict of series can also be an input to a DataFrame. 

Let’s learn how to create DataFrame from a dict of series.

Let’s create two series first. The first series, ‘olympic_series_participation’, is for the number of countries participating for the given year. The second series, ‘olympic_series_country’, is for the cities which held the Olympics that year. 
Now, create a DataFrame ‘df_olympic_series’ and pass both the series as dicts in it. You can also assign column names in the DataFrame and manipulate the dataset as shown in this example. 


In [29]:
olympic_series_participation = pd.Series([205,205,201,200,197], index = [2012,2008,2004,2000,1996])
olympic_series_countries = pd.Series(['London', 'Beijing', 'Athens', 'Sydney', 'Atlanta'], index = [2012,2008,2004,2000,1996])


In [30]:
df_olympic_series = pd.DataFrame({'No of Participating Countries':olympic_series_participation, 
                                  'HostCity': olympic_series_countries })

In [31]:
(df_olympic_series)

Unnamed: 0,No of Participating Countries,HostCity
2012,205,London
2008,205,Beijing
2004,201,Athens
2000,200,Sydney
1996,197,Atlanta


## ndarray
    
An ndarray can be used as an input to creating Pandas DataFrame. The use of ndarray is recommended wherever the dataset is number centric and when instances require complex numerical computing.


In [32]:
# Create an ndarrays with years

np_array = np.array([2012,2008,2004,2006])
dict_ndarray = {'year':np_array}

In [33]:
# Create a dict with the ndarray

df_ndarray = pd.DataFrame(dict_ndarray)


In [34]:
# Pass this dict to a new DataFrame

df_ndarray

Unnamed: 0,year
0,2012
1,2008
2,2004
3,2006


## DataFrame: 

You can also create a DataFrame from a DataFrame by passing it as an object.

In [35]:
df_from_df = pd.DataFrame(df_olympic_series)
df_from_df

Unnamed: 0,No of Participating Countries,HostCity
2012,205,London
2008,205,Beijing
2004,201,Athens
2000,200,Sydney
1996,197,Atlanta


# Data Operation with Statistical Functions

In [45]:
df_test_scores = pd.DataFrame({'Test1': [95,84,73,88,82,61], 'Test2': [74,85,82,73,77,79]}, 
                              index = ['Jack','Lewis','Patrick','Rich','Kelly','Paula'])


In [46]:
df_test_scores.max()

Test1    95
Test2    85
dtype: int64

In [47]:
df_test_scores.min()

Test1    61
Test2    73
dtype: int64

In [48]:
df_test_scores.mean()

Test1    80.500000
Test2    78.333333
dtype: float64

In [49]:
df_test_scores.std()

Test1    11.979149
Test2     4.633213
dtype: float64