# Project - Play with DataFrames

## Goal of Project
- Master pandas DataFrame

### Step 1: Import pandas
- Execute the cell below (SHIFT + ENTER)

In [1]:
import pandas as pd

### Step 2: Read the data
- Use ```pd.read_csv()``` to read the file `files/population.csv`
- NOTE: Remember to assign the result to a variable (e.g., ```data```)

In [24]:
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/jupyter/final/files/population.csv')
data.head()

Unnamed: 0,Country,Year,Population
0,Denmark,2000,5.3
1,Denmark,2010,5.5
2,Denmark,2020,5.8
3,Sweden,2000,8.8
4,Sweden,2010,9.3


### Step 3: Investigate the data types
- Use ```.dtypes``` 

In [3]:
data.dtypes

Country        object
Year            int64
Population    float64
dtype: object

### Step 4: Convert Year to Datetime
- ```pd.to_datetime(...)```: Convert to a datetime
- ```format='%Y'```: Format of input, here it is the year.

In [27]:
data['Year'] = pd.to_datetime(data['Year'],format = '%Y') 

In [28]:
data['Year']

0   2000-01-01
1   2010-01-01
2   2020-01-01
3   2000-01-01
4   2010-01-01
5   2020-01-01
Name: Year, dtype: datetime64[ns]

In [17]:
data.dtypes

Country               object
Year          datetime64[ns]
Population           float64
dtype: object

### Step 5: Scale Population to millions
- HINT: ```data['Population']*1000``` scales by 1000

In [29]:
data['Population'] = data['Population']*1000

In [30]:
data['Population']

0     5300.0
1     5500.0
2     5800.0
3     8800.0
4     9300.0
5    10200.0
Name: Population, dtype: float64

In [31]:
data.head()

Unnamed: 0,Country,Year,Population
0,Denmark,2000-01-01,5300.0
1,Denmark,2010-01-01,5500.0
2,Denmark,2020-01-01,5800.0
3,Sweden,2000-01-01,8800.0
4,Sweden,2010-01-01,9300.0


In [32]:
data

Unnamed: 0,Country,Year,Population
0,Denmark,2000-01-01,5300.0
1,Denmark,2010-01-01,5500.0
2,Denmark,2020-01-01,5800.0
3,Sweden,2000-01-01,8800.0
4,Sweden,2010-01-01,9300.0
5,Sweden,2020-01-01,10200.0


### Step 6: Calculate mean population for each country
- HINT: ```data.groupby('Country')``` groups the data

In [33]:
data.groupby('Country')['Population'].mean()

Country
Denmark    5533.333333
Sweden     9433.333333
Name: Population, dtype: float64

### Step 7: Replace Denmark to DNK
- Given a column you can access the string functions on it with ```.str```
    - This enables you to apply string functions on it
    - HINT: ```data['Country'].str.replace('Denmark', 'DNK')```

In [34]:
data['Country'] = data['Country'].str.replace('Denmark','DNK')

In [35]:
data.head()

Unnamed: 0,Country,Year,Population
0,DNK,2000-01-01,5300.0
1,DNK,2010-01-01,5500.0
2,DNK,2020-01-01,5800.0
3,Sweden,2000-01-01,8800.0
4,Sweden,2010-01-01,9300.0


In [36]:
data['Country'] = data['Country'].str.replace('Sweden','SWD')

In [37]:
data

Unnamed: 0,Country,Year,Population
0,DNK,2000-01-01,5300.0
1,DNK,2010-01-01,5500.0
2,DNK,2020-01-01,5800.0
3,SWD,2000-01-01,8800.0
4,SWD,2010-01-01,9300.0
5,SWD,2020-01-01,10200.0


In [42]:
data.iloc[1:3]

Unnamed: 0,Country,Year,Population
1,DNK,2010-01-01,5500.0
2,DNK,2020-01-01,5800.0
