# pandas intro
28/06/2022

## Additional Learning Resources
Refer to [scikit-learn documentation](https://scikit-learn.org/stable/) and the [Pandas user guide](https://pandas.pydata.org/docs/) for detailed explanations of the functions used in this notebook.
For a quick refresher on splitting data:
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```


After this encounter you should be able to 
- use selected methods of the pandas library
- use pandas documentation for finding and using selected methods 

### Table of content
1. [import pandas](#import)
2. [read csv file](#read)
3. [index and columns](#index)
4. [head and tail](#head)
5. [set index](#setindex)
6. [loc and iloc](#loc)
7. [select conditional](#masking)
8. [nan](#nan)
9. [save files](#save)
10. [plot](#plot)

<a id='import' />

#### Import ```pandas``` package as ```pd```

In [1]:
import pandas as pd

[UP](#toc)
<a id='read'/>

#### Load data file with separator ```,```

In [2]:
df = pd.read_csv("population.csv")

[UP](#toc)
<a id='index'/>

#### Get index and columns of the DataFrame

In [3]:
df.index

RangeIndex(start=0, stop=275, step=1)

In [4]:
df.shape

(275, 82)

In [None]:
# Practice: implement the steps discussed above


[UP](#toc)
<a id='head'/>

#### Get first (```head```) or last (```tail```) 5 rows of DataFrame

In [5]:
df.head(2)

Unnamed: 0,Total population,1800,1810,1820,1830,1840,1850,1860,1870,1880,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,Abkhazia,,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,3280000.0,3280000.0,3323519.0,3448982.0,3625022.0,3810047.0,3973968.0,4169690.0,4419695.0,...,25183615.0,25877544.0,26528741.0,27207291.0,27962207.0,28809167.0,29726803.0,30682500.0,31627506.0,32526562.0


In [6]:
df.tail(2)

Unnamed: 0,Total population,1800,1810,1820,1830,1840,1850,1860,1870,1880,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
273,Virgin Islands,,,,,,,,,,...,,,,,,,,,,
274,West Bank,,,,,,,,,,...,,,,,,,,,,


In [None]:
# Practice: implement the steps discussed above


[UP](#toc)
<a id='setindex'/>

#### Change index to column ```Total population``` and drop the column

In [7]:
df.set_index(
    "Total population",
    inplace = True
)

In [8]:
df.index

Index(['Abkhazia', 'Afghanistan', 'Akrotiri and Dhekelia', 'Albania',
       'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla',
       'Antigua and Barbuda',
       ...
       'British Indian Ocean Territory', 'Clipperton',
       'French Southern and Antarctic Lands', 'Gaza Strip',
       'Heard and McDonald Islands', 'Northern Marianas',
       'South Georgia and the South Sandwich Islands',
       'US Minor Outlying Islands', 'Virgin Islands', 'West Bank'],
      dtype='object', name='Total population', length=275)

In [9]:
df.columns

Index(['1800', '1810', '1820', '1830', '1840', '1850', '1860', '1870', '1880',
       '1890', '1900', '1910', '1920', '1930', '1940', '1950', '1951', '1952',
       '1953', '1954', '1955', '1956', '1957', '1958', '1959', '1960', '1961',
       '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970',
       '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979',
       '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988',
       '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997',
       '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006',
       '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015'],
      dtype='object')

This is the long version:

In [10]:
# df.index = df["Total population"]

In [11]:
# df.index

In [12]:
#df.columns

In [13]:
#df.drop("Total population",
 #       axis = 1,
  #      inplace = True
   #    )

Shorter way to redefine our index (from numeric to "categoric"):

In [None]:
# Practice: implement the steps discussed above


[UP](#toc)
<a id='loc'/>

#### Use ```loc``` and ```iloc``` to access entries in DataFrame

In [14]:
df.loc["Argentina":"Nigeria"]

Unnamed: 0_level_0,1800,1810,1820,1830,1840,1850,1860,1870,1880,1890,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
Total population,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Argentina,534000.0,534000.0,570719.0,686703.0,873747.0,1113189.0,1421333.0,1856886.0,2493156.0,3402273.0,...,39558750.0,39969903.0,40381860.0,40798641.0,41222875.0,41655616.0,42095224.0,42538304.0,42980026.0,43416755.0
Armenia,413326.0,413326.0,423527.0,453507.0,496835.0,544302.0,595928.0,652450.0,713957.0,781218.0,...,3002161.0,2988117.0,2975029.0,2966108.0,2963496.0,2967984.0,2978339.0,2992192.0,3006154.0,3017712.0
Aruba,19286.0,19286.0,19555.0,20332.0,21423.0,22574.0,23786.0,25063.0,26404.0,27817.0,...,100830.0,101218.0,101342.0,101416.0,101597.0,101936.0,102393.0,102921.0,103441.0,103889.0
Australia,351014.0,342440.0,334002.0,348143.0,434095.0,742619.0,1256048.0,1724213.0,2253007.0,3088808.0,...,20606228.0,20975949.0,21370348.0,21770690.0,22162863.0,22542371.0,22911375.0,23270465.0,23622353.0,23968973.0
Austria,3205587.0,3286650.0,3391206.0,3538286.0,3728381.0,3962619.0,4235926.0,4556658.0,4947026.0,5408503.0,...,8269372.0,8301290.0,8331465.0,8361362.0,8391986.0,8423559.0,8455477.0,8486962.0,8516916.0,8544586.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
New Zealand,100000.0,100000.0,100000.0,91723.0,82479.0,94934.0,157114.0,301045.0,505065.0,669985.0,...,4187584.0,4238021.0,4285380.0,4329124.0,4369027.0,4404483.0,4435883.0,4465276.0,4495482.0,4528526.0
Ngorno-Karabakh,,,,,,,,,,,...,,,,,,,,,,
Nicaragua,219387.0,219387.0,229075.0,258484.0,303135.0,345374.0,375911.0,405582.0,449633.0,505573.0,...,5450217.0,5522119.0,5594524.0,5666595.0,5737722.0,5807787.0,5877034.0,5945646.0,6013913.0,6082032.0
Niger,1244861.0,1244861.0,1262280.0,1312557.0,1383237.0,1457723.0,1543423.0,1634127.0,1729822.0,1831048.0,...,13995530.0,14527631.0,15085130.0,15672194.0,16291990.0,16946485.0,17635782.0,18358863.0,19113728.0,19899120.0


In [19]:

df.loc["Argentina", "1800":"1888"]>=200000 # CHECK


1800    True
1810    True
1820    True
1830    True
1840    True
1850    True
1860    True
1870    True
1880    True
Name: Argentina, dtype: bool

In [20]:
df["Argentina",:]

TypeError: '('Argentina', slice(None, None, None))' is an invalid key

In [None]:
# how to extract row vals without .loc?

In [None]:
df.loc["Argentina":"Nigeria","1810":"1999"]

In [None]:
df.iloc[0:5]

In [None]:
df.iloc[0:5, 1:10]

[UP](#toc)
<a id='masking'/>

#### Use ```masking``` to select data based on condition

In [None]:
df.head(2)

In [None]:
df[
    df["1800"]>=200000
]

In [None]:
df[
    df.iloc[1:3, 1:10]>=200000
]

In [None]:
# (df["1800"]>=200000).to_frame() => Not a priority ;) 

[UP](#toc)
<a id='nan'/>

#### Find NaNs in a column and drop the rows containing an NaN

In [None]:
df_na = df.isna().sum()

In [None]:
df.dropna(inplace = True)

In [None]:
df.isna().sum()

In [None]:
df.shape

[UP](#toc)
<a id='save'/>

#### Save DataFrame to ```csv``` or ```xlsx```

In [None]:
df.to_csv("population_modified.csv")

In [None]:
df.to_excel("population_modified.xlsx")

[UP](#toc)
<a id='plot'/>

#### Plot histogram 

In [None]:
df["1800"].hist()

In [None]:
# Practice: implement the steps discussed above
