## Create data frame

### Import

In [1]:
import pandas as pd

### Create database using dictionary

In [2]:
data = {
    'year': [2008, 2012, 2016],
    'attendees': [112, 321, 729],
    'average age': [24, 43, 333]
}
df = pd.DataFrame(data)
df

Unnamed: 0,year,attendees,average age
0,2008,112,24
1,2012,321,43
2,2016,729,333


### Append dataframe with dictionary

In [3]:
data2 = {
    'year': 2000,
    'attendees': 344,
    'average age': 'a'
}
# New object! Not in place.
new_df = df.append(data2, ignore_index=True)
new_df

Unnamed: 0,year,attendees,average age
0,2008,112,24
1,2012,321,43
2,2016,729,333
3,2000,344,a


### Loading CSV data

In [4]:
csv_data = pd.read_csv('countries.csv')
print(type(csv_data))

<class 'pandas.core.frame.DataFrame'>


## Preview and examine data in a Pandas DataFrame

You’ll notice that Pandas displays only 20 columns by default for wide data dataframes, and only 60 or so rows, truncating the middle section. If you’d like to change these limits, you can edit the defaults using some internal options for Pandas displays (simple use pd.display.options.XX = value to set these):

pd.display.options.width – the width of the display in characters – use this if your display is wrapping rows over more than one line.

pd.display.options.max_rows – maximum number of rows displayed.

pd.display.options.max_columns – maximum number of columns displayed.


### Data shape

In [5]:
# Number of rows and columns.
csv_data.shape

(1704, 6)

### Preview data frames

In [6]:
csv_data.head(5)

Unnamed: 0,country,continent,year,lifeExpectancy,population,gdpPerCapita
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [7]:
csv_data.tail(5)

Unnamed: 0,country,continent,year,lifeExpectancy,population,gdpPerCapita
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.44996
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298


### Data types of columns

In [8]:
csv_data.dtypes

country            object
continent          object
year                int64
lifeExpectancy    float64
population          int64
gdpPerCapita      float64
dtype: object

In [9]:
# Change datatype of a specific column
csv_data['country'].astype(str).head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

## Describing data

In [10]:
csv_data['lifeExpectancy'].describe()

count    1704.000000
mean       59.474439
std        12.917107
min        23.599000
25%        48.198000
50%        60.712500
75%        70.845500
max        82.603000
Name: lifeExpectancy, dtype: float64

In [11]:
csv_data.describe()

Unnamed: 0,year,lifeExpectancy,population,gdpPerCapita
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,59.474439,29601210.0,7215.327081
std,17.26533,12.917107,106157900.0,9857.454543
min,1952.0,23.599,60011.0,241.165877
25%,1965.75,48.198,2793664.0,1202.060309
50%,1979.5,60.7125,7023596.0,3531.846989
75%,1993.25,70.8455,19585220.0,9325.462346
max,2007.0,82.603,1318683000.0,113523.1329


## Selecting and Manipulating Data

### Selecting columns

There are three main methods of selecting columns in pandas:

using a dot notation, e.g. data.column_name,

using square braces and the name of the column as a string, e.g. data['column_name']

using numeric indexing and the iloc selector data.iloc[:, <column_number>]

In [12]:
csv_data.country.head() # Primary method.

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

In [13]:
csv_data['country'].head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

In [14]:
csv_data.iloc[:, 0].head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

In [15]:
# Series summary operations.
# We are selecting the column .gdpPerCapita, and performing various calculations.
[csv_data.gdpPerCapita.sum(), # Total sum of the column values
 csv_data.gdpPerCapita.mean(), # Mean of the column values
 csv_data.gdpPerCapita.median(), # Median of the column values
 csv_data.gdpPerCapita.nunique(), # Number of unique entries
 csv_data.gdpPerCapita.max(), # Maximum of the column values
 csv_data.gdpPerCapita.min()] # Minimum of the column values

[12294917.346385501,
 7215.327081212149,
 3531.8469885000004,
 1704,
 113523.1329,
 241.16587650000002]

Selecting multiple columns at the same time extracts a new DataFrame from your existing DataFrame. For selection of multiple columns, the syntax is:

square-brace selection with a list of column names, e.g. 

data[['column_name_1', 'column_name_2']]

using numeric indexing with the iloc selector and a list of column numbers, e.g. 

data.iloc[:, [0,1,20,22]]

### Selecting rows

The basic methods to get your heads around are:

numeric row selection using the iloc selector, e.g. 

data.iloc[0:10, :] – select the first 10 rows.

label-based row selection using the loc selector (this is only applicably if you have set an “index” on your dataframe. e.g. 

data.loc[44, :]

logical-based row selection using evaluated statements, e.g. 

data[data["Area"] == "Ireland"] – select the rows where Area value is ‘Ireland’.

In [16]:
csv_data[csv_data.country == 'Afghanistan'].head()

Unnamed: 0,country,continent,year,lifeExpectancy,population,gdpPerCapita
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [17]:
csv_data.loc[44, :]

country            Angola
continent          Africa
year                 1992
lifeExpectancy     40.647
population        8735988
gdpPerCapita      2627.85
Name: 44, dtype: object

In [18]:
csv_data.iloc[0:5, :]

Unnamed: 0,country,continent,year,lifeExpectancy,population,gdpPerCapita
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


### Deleting rows and columns

In [19]:
# Deleting columns
# Delete the "Area" column from the dataframe
csv_data = csv_data.drop("country", axis=1)
csv_data.head()

Unnamed: 0,continent,year,lifeExpectancy,population,gdpPerCapita
0,Asia,1952,28.801,8425333,779.445314
1,Asia,1957,30.332,9240934,820.85303
2,Asia,1962,31.997,10267083,853.10071
3,Asia,1967,34.02,11537966,836.197138
4,Asia,1972,36.088,13079460,739.981106


In [20]:
# alternatively, delete columns using the columns parameter of drop
csv_data = csv_data.drop(columns="year")
csv_data.head()

Unnamed: 0,continent,lifeExpectancy,population,gdpPerCapita
0,Asia,28.801,8425333,779.445314
1,Asia,30.332,9240934,820.85303
2,Asia,31.997,10267083,853.10071
3,Asia,34.02,11537966,836.197138
4,Asia,36.088,13079460,739.981106


In [21]:
# Delete the Area column from the dataframe in place
# Note that the original 'data' object is changed when inplace=True
csv_data.drop("continent", axis=1, inplace=True)
csv_data.head()

Unnamed: 0,lifeExpectancy,population,gdpPerCapita
0,28.801,8425333,779.445314
1,30.332,9240934,820.85303
2,31.997,10267083,853.10071
3,34.02,11537966,836.197138
4,36.088,13079460,739.981106


In [22]:
# Delete multiple columns from the dataframe
csv_data = csv_data.drop(['population', 'gdpPerCapita'], axis=1)
csv_data.head()

Unnamed: 0,lifeExpectancy
0,28.801
1,30.332
2,31.997
3,34.02
4,36.088


In [23]:
# Delete rows.
csv_data.drop([0,1], axis=0).head()

Unnamed: 0,lifeExpectancy
2,31.997
3,34.02
4,36.088
5,38.438
6,39.854


In [24]:
data = pd.read_csv('countries.csv')

In [25]:
# Delete the rows with labels 0,1,5
data = data.drop([0,1,2], axis=0)

In [26]:
# Delete the rows with label "Ireland"
# For label-based deletion, set the index first on the dataframe:
data = data.set_index("country")
data = data.drop("Afghanistan", axis=0) # Delete all rows with label "Afghanistan"

In [27]:
# Delete the first five rows using iloc selector
data = data.iloc[5:,]

### Renaming columns

In [28]:
# Rename columns using a dictionary to map values
# Rename the Area columnn to 'place_name'
data = data.rename(columns={"continent": "place_name"})
data.head()

Unnamed: 0_level_0,place_name,year,lifeExpectancy,population,gdpPerCapita
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Albania,Europe,1977,68.93,2509048,3533.00391
Albania,Europe,1982,70.42,2780097,3630.880722
Albania,Europe,1987,72.0,3075321,3738.932735
Albania,Europe,1992,71.581,3326498,2497.437901
Albania,Europe,1997,72.95,3428038,3193.054604


In [29]:
# Again, the inplace parameter will change the dataframe without assignment
data.rename(columns={"population": "num bers"}, inplace=True)
data.head()

Unnamed: 0_level_0,place_name,year,lifeExpectancy,num bers,gdpPerCapita
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Albania,Europe,1977,68.93,2509048,3533.00391
Albania,Europe,1982,70.42,2780097,3630.880722
Albania,Europe,1987,72.0,3075321,3738.932735
Albania,Europe,1992,71.581,3326498,2497.437901
Albania,Europe,1997,72.95,3428038,3193.054604


In [30]:
# Rename multiple columns in one go with a larger dictionary
data.rename(
    columns={
        "country": "name", # can't rename index
        "numbers": "year 2001"
    },
    inplace=True
)
data.head()

Unnamed: 0_level_0,place_name,year,lifeExpectancy,num bers,gdpPerCapita
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Albania,Europe,1977,68.93,2509048,3533.00391
Albania,Europe,1982,70.42,2780097,3630.880722
Albania,Europe,1987,72.0,3075321,3738.932735
Albania,Europe,1992,71.581,3326498,2497.437901
Albania,Europe,1997,72.95,3428038,3193.054604


In [31]:
# Rename all columns using a function, e.g. convert all column names to lower case:
data.rename(columns=str.lower)
data.head()

Unnamed: 0_level_0,place_name,year,lifeExpectancy,num bers,gdpPerCapita
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Albania,Europe,1977,68.93,2509048,3533.00391
Albania,Europe,1982,70.42,2780097,3630.880722
Albania,Europe,1987,72.0,3075321,3738.932735
Albania,Europe,1992,71.581,3326498,2497.437901
Albania,Europe,1997,72.95,3428038,3193.054604


In [32]:
# Quickly lowercase and camelcase all column names in a DataFrame
data.rename(columns=lambda x: x.lower().replace(' ', '_'))
data.head()

Unnamed: 0_level_0,place_name,year,lifeExpectancy,num bers,gdpPerCapita
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Albania,Europe,1977,68.93,2509048,3533.00391
Albania,Europe,1982,70.42,2780097,3630.880722
Albania,Europe,1987,72.0,3075321,3738.932735
Albania,Europe,1992,71.581,3326498,2497.437901
Albania,Europe,1997,72.95,3428038,3193.054604


## Exporting and saving data

In [33]:
# Output data to a CSV file
# Typically, I don't want row numbers in my output file, hence index=False.
# To avoid character issues, I typically use utf8 encoding for input/output.
data.to_csv("output_filename.csv", index=False, encoding='utf8')