## Importing pandas

### Getting started and checking your pandas setup

**1.** Import pandas under the alias `pd`.

In [1]:
import numpy as np
import pandas as pd

**2.** Print the version of pandas that has been imported.

In [2]:
pd.show_versions()


INSTALLED VERSIONS
------------------
commit           : 2cb96529396d93b46abab7bbc73a208e708c642e
python           : 3.8.8.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19041
machine          : AMD64
processor        : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_United States.1252

pandas           : 1.2.4
numpy            : 1.20.1
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.0.1
setuptools       : 52.0.0.post20210125
Cython           : 0.29.23
pytest           : 6.2.3
hypothesis       : None
sphinx           : 4.0.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.8
lxml.etree       : 4.6.3
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.3
IPython          : 7.22.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck  

**3.** Try checking for the help of any of the function in pandas.

In [3]:
help(pd.cut)

Help on function cut in module pandas.core.reshape.tile:

cut(x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = 'raise', ordered: bool = True)
    Bin values into discrete intervals.
    
    Use `cut` when you need to segment and sort data values into bins. This
    function is also useful for going from a continuous variable to a
    categorical variable. For example, `cut` could convert ages to groups of
    age ranges. Supports binning into an equal number of bins, or a
    pre-specified array of bins.
    
    Parameters
    ----------
    x : array-like
        The input array to be binned. Must be 1-dimensional.
    bins : int, sequence of scalars, or IntervalIndex
        The criteria to bin by.
    
        * int : Defines the number of equal-width bins in the range of `x`. The
          range of `x` is extended by .1% on each side to include the minimum
          and maximum values of `x`.
    

## DataFrame basics

### A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames


Consider the following Python dictionary `data` and Python list `labels`:

``` python
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
```
**4.** Create a DataFrame `df` from this dictionary `data` which has the index `labels`.

In [4]:
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [5]:
df = pd.DataFrame.from_records(data,labels)
print(df)

   age animal priority  visits
a  2.5    cat      yes       1
b  3.0    cat      yes       3
c  0.5  snake       no       2
d  NaN    dog      yes       3
e  5.0    dog       no       2
f  2.0    cat       no       3
g  4.5  snake       no       1
h  NaN    cat      yes       1
i  7.0    dog       no       2
j  3.0    dog       no       1


In [6]:
df

Unnamed: 0,age,animal,priority,visits
a,2.5,cat,yes,1
b,3.0,cat,yes,3
c,0.5,snake,no,2
d,,dog,yes,3
e,5.0,dog,no,2
f,2.0,cat,no,3
g,4.5,snake,no,1
h,,cat,yes,1
i,7.0,dog,no,2
j,3.0,dog,no,1


**5.** Display a summary of the basic information about this DataFrame and its data (*hint: there is a single method that can be called on the DataFrame*).

In [7]:
df.describe(include = "all")

Unnamed: 0,age,animal,priority,visits
count,8.0,10,10,10.0
unique,,3,2,
top,,cat,no,
freq,,4,6,
mean,3.4375,,,1.9
std,2.007797,,,0.875595
min,0.5,,,1.0
25%,2.375,,,1.0
50%,3.0,,,2.0
75%,4.625,,,2.75


**6.** Return the first 3 rows of the DataFrame `df`.

In [8]:
df.head(3)

Unnamed: 0,age,animal,priority,visits
a,2.5,cat,yes,1
b,3.0,cat,yes,3
c,0.5,snake,no,2


**7.** Select just the 'animal' and 'age' columns from the DataFrame `df`.

In [9]:
columns = ['animal','age']
df[columns]

Unnamed: 0,animal,age
a,cat,2.5
b,cat,3.0
c,snake,0.5
d,dog,
e,dog,5.0
f,cat,2.0
g,snake,4.5
h,cat,
i,dog,7.0
j,dog,3.0


**8.** Select the data in rows `[3, 4, 8]` *and* in columns `['animal', 'age']`.

In [10]:
columns = ['animal','age']
rows = [2,3,7]
df[columns].iloc[rows]

Unnamed: 0,animal,age
c,snake,0.5
d,dog,
h,cat,


**9.** Select only the rows where the number of visits is greater than 3.

In [11]:
df.loc[df['visits']>3]

Unnamed: 0,age,animal,priority,visits


**10.** Check for missing values in the data.

In [12]:
df.isnull().sum()

age         2
animal      0
priority    0
visits      0
dtype: int64

**11.** Select the rows where the animal is a cat *and* the age is less than 3.

In [13]:
df[(df['animal'] == 'cat') & (df['age'] < 3)]

Unnamed: 0,age,animal,priority,visits
a,2.5,cat,yes,1
f,2.0,cat,no,3


**12.** Select the rows the age is between 2 and 4 (inclusive).

In [14]:
df[(2<=df['age']) & (df['age']<=4)]

Unnamed: 0,age,animal,priority,visits
a,2.5,cat,yes,1
b,3.0,cat,yes,3
f,2.0,cat,no,3
j,3.0,dog,no,1


**13.** Change the age in row 'f' to 1.5.

In [15]:
df['age'].loc['f'] = 1.5
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,age,animal,priority,visits
a,2.5,cat,yes,1
b,3.0,cat,yes,3
c,0.5,snake,no,2
d,,dog,yes,3
e,5.0,dog,no,2
f,1.5,cat,no,3
g,4.5,snake,no,1
h,,cat,yes,1
i,7.0,dog,no,2
j,3.0,dog,no,1


**14.** Calculate the sum of all visits in `df` (i.e. find the total number of visits).

In [16]:
df.visits.sum()

19

**15.** Calculate the mean age for each different animal in `df`. Explore the groupby function.

In [17]:
mean_age = df.groupby('animal')['age'].mean()
mean_age

animal
cat      2.333333
dog      5.000000
snake    2.500000
Name: age, dtype: float64

**16.** Append a new row 'k' to `df` with your choice of values for each column. Then delete that row to return the original DataFrame.

In [18]:
df.loc['k'] = [2.0, 'horse', 'yes', 5]  
display(df)

Unnamed: 0,age,animal,priority,visits
a,2.5,cat,yes,1
b,3.0,cat,yes,3
c,0.5,snake,no,2
d,,dog,yes,3
e,5.0,dog,no,2
f,1.5,cat,no,3
g,4.5,snake,no,1
h,,cat,yes,1
i,7.0,dog,no,2
j,3.0,dog,no,1


**17.** Count the number of each type of animal in `df`.

In [19]:
df['animal'].value_counts()

dog      4
cat      4
snake    2
horse    1
Name: animal, dtype: int64

**18.** Sort `df` first by the values in the 'age' in *decending* order, then by the value in the 'visit' column in *ascending* order (so row `i` should be first, and row `d` should be last).

In [21]:
df.sort_values(['age', 'visits'], ascending=[False, True])

Unnamed: 0,age,animal,priority,visits
i,7.0,dog,no,2
e,5.0,dog,no,2
g,4.5,snake,no,1
j,3.0,dog,no,1
b,3.0,cat,yes,3
a,2.5,cat,yes,1
k,2.0,horse,yes,5
f,1.5,cat,no,3
c,0.5,snake,no,2
h,,cat,yes,1


**19.** The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be `True` and 'no' should be `False`.

In [23]:
df['priority'].replace(to_replace=['no','yes'], value=[False,True],inplace=True)

In [24]:
df.head(5)

Unnamed: 0,age,animal,priority,visits
a,2.5,cat,True,1
b,3.0,cat,True,3
c,0.5,snake,False,2
d,,dog,True,3
e,5.0,dog,False,2


**20.** In the 'animal' column, change the 'snake' entries to 'python'.

In [30]:

df['animal'].replace(to_replace=['snake'], value=['python'],inplace=True)
df.head(5)       

Unnamed: 0,age,animal,priority,visits
a,2.5,cat,True,1
b,3.0,cat,True,3
c,0.5,python,False,2
d,,dog,True,3
e,5.0,dog,False,2
