## Data Manipulation With Pandas

### What's the point of pandas?

- Data Manipulation skill track 
- Data Visualization skill track

### Pandas is built on 

- numpy
- matplotlib

## Introducing Dataframes

In [1]:
import pandas as pd
house = pd.read_csv("house.csv")

### .head()

- to see first five rows of data.

In [2]:
# Print the head of the house data
print(house.head())



     id                                              name  host_id  \
0  2539                Clean & quiet apt home by the park     2787   
1  2595                             Skylit Midtown Castle     2845   
2  3647               THE VILLAGE OF HARLEM....NEW YORK !     4632   
3  3831                   Cozy Entire Floor of Brownstone     4869   
4  5022  Entire Apt: Spacious Studio/Loft by central park     7192   

     host_name neighbourhood_group neighbourhood  latitude  longitude  \
0         John            Brooklyn    Kensington  40.64749  -73.97237   
1     Jennifer           Manhattan       Midtown  40.75362  -73.98377   
2    Elisabeth           Manhattan        Harlem  40.80902  -73.94190   
3  LisaRoxanne            Brooklyn  Clinton Hill  40.68514  -73.95976   
4        Laura           Manhattan   East Harlem  40.79851  -73.94399   

         room_type  price  minimum_nights  number_of_reviews last_review  \
0     Private room    149               1                  9  20

### info() method

- used to get information about data

In [3]:
# Print information about house
print(house.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

## .shape

- print shape of data 
- rows and columns

In [4]:
# Print the shape of house
print(house.shape)


(48895, 16)


## .describe()

- display the mean median and standard deviation etc.

In [5]:
# Print a description of house
print(house.describe().T)

                                  count          mean           std  \
id                              48895.0  1.901714e+07  1.098311e+07   
host_id                         48895.0  6.762001e+07  7.861097e+07   
latitude                        48895.0  4.072895e+01  5.453008e-02   
longitude                       48895.0 -7.395217e+01  4.615674e-02   
price                           48895.0  1.527207e+02  2.401542e+02   
minimum_nights                  48895.0  7.029962e+00  2.051055e+01   
number_of_reviews               48895.0  2.327447e+01  4.455058e+01   
reviews_per_month               38843.0  1.373221e+00  1.680442e+00   
calculated_host_listings_count  48895.0  7.143982e+00  3.295252e+01   
availability_365                48895.0  1.127813e+02  1.316223e+02   

                                       min           25%           50%  \
id                              2539.00000  9.471945e+06  1.967728e+07   
host_id                         2438.00000  7.822033e+06  3.079382e+07

## SUMMARY STATISTICS 

- It tells about your data

### Summarizing numerical data 

#### MEAN

- Tell about the center of your data
- you can calculate mean by selection the column with square brackets
- we can also find .min()
- .max()
- .agg()
- .median()
- .mode()
- .std()
- .var()

In [6]:
# Print the mean of house
print(house['price'].mean())


152.7206871868289


In [7]:

# Print the median of house
print(house['price'].median())

106.0


In [8]:
# Print the maximum of the price column
print(max(house['price']))



10000


In [9]:
# Print the minimum of the price column
print(min(house['price']))

0


### CUMULATIVE SUMMARY

In [10]:
# Sorting
price_house = house.sort_values('price',ascending= True)


In [11]:

# Get the cumulative sum 
house['cum_house_price'] = house['price'].cumsum()


In [12]:

# Get the cumulative max 
house['cum_house_price_max'] = house['price'].cummax()


# See the columns you calculated
print(house[["price", "cum_house_price", "cum_house_price_max"]])

       price  cum_house_price  cum_house_price_max
0        149              149                  149
1        225              374                  225
2        150              524                  225
3         89              613                  225
4         80              693                  225
...      ...              ...                  ...
48890     70          7466978                10000
48891     40          7467018                10000
48892    115          7467133                10000
48893     55          7467188                10000
48894     90          7467278                10000

[48895 rows x 3 columns]


### DUPLICATE

In [13]:
house_price_dup = house.drop_duplicates(subset=['price'])
house_price_dup

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,cum_house_price,cum_house_price_max
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365,149,149
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,374,225
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365,524,225
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,613,225
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0,693,225
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48080,36074198,Luxury apartment 2 min to times square,203565865,Vinícius,Manhattan,SoHo,40.72060,-74.00023,Entire home/apt,1308,2,0,,,1,179,7344033,10000
48232,36146672,Veneta New York suite,266037277,Gianluigi,Manhattan,Hell's Kitchen,40.76687,-73.98644,Private room,561,1,0,,,1,364,7366898,10000
48523,36308562,"Tasteful & Trendy Brooklyn Brownstone, near Train",217732163,Sandy,Brooklyn,Bedford-Stuyvesant,40.68767,-73.95805,Entire home/apt,1369,1,0,,,1,349,7413956,10000
48535,36311055,"Stunning & Stylish Brooklyn Luxury, near Train",245712163,Urvashi,Brooklyn,Bedford-Stuyvesant,40.68245,-73.93417,Entire home/apt,1749,1,0,,,1,303,7417845,10000


In [14]:
### EXPLICIT INDEXES

In [15]:
# Look at temperatures
print(house)

# Index temperatures by city
house_ind = house['neighbourhood']

# Look at temperatures_ind
print(house_ind)

# Reset the index, keeping its contents
print(house_ind.reset_index())

# Reset the index, dropping its contents
print(house_ind.reset_index(drop = True))

             id                                               name   host_id  \
0          2539                 Clean & quiet apt home by the park      2787   
1          2595                              Skylit Midtown Castle      2845   
2          3647                THE VILLAGE OF HARLEM....NEW YORK !      4632   
3          3831                    Cozy Entire Floor of Brownstone      4869   
4          5022   Entire Apt: Spacious Studio/Loft by central park      7192   
...         ...                                                ...       ...   
48890  36484665    Charming one bedroom - newly renovated rowhouse   8232441   
48891  36485057      Affordable room in Bushwick/East Williamsburg   6570630   
48892  36485431            Sunny Studio at Historical Neighborhood  23492952   
48893  36485609               43rd St. Time Square-cozy single bed  30985759   
48894  36487245  Trendy duplex in the very heart of Hell's Kitchen  68119814   

           host_name neighbourhood_grou

In [16]:
# Make a list of cities to subset on
house_neigh = ['Midtown', 'Harlem']

# Subset temperatures using square brackets
print(house[house['neighbourhood'].isin(house_neigh)])


             id                                             name    host_id  \
1          2595                            Skylit Midtown Castle       2845   
2          3647              THE VILLAGE OF HARLEM....NEW YORK !       4632   
30         9668                            front room/double bed      32294   
31         9704              Spacious 1 bedroom in luxe building      32045   
33         9783                              back room/bunk beds      32294   
...         ...                                              ...        ...   
48849  36455579                        Studio in Manhattan(独立出入)  257261595   
48871  36475746    A LARGE ROOM - 1 MONTH MINIMUM - WASHER&DRYER  144008701   
48876  36478357  Cozy, Air-Conditioned Private Bedroom in Harlem  177932088   
48886  36483010                  Comfy 1 Bedroom in Midtown East  274311461   
48892  36485431          Sunny Studio at Historical Neighborhood   23492952   

             host_name neighbourhood_group neighbou

In [17]:
##Slicing and subsetting with .loc and .iloc

In [18]:
# Sort the index of temperatures_ind
house_srt = house_ind.sort_index()

# Subset rows from Pakistan to Russia
print(house_srt.loc['Midtown':'Clinton Hill'])



Series([], Name: neighbourhood, dtype: object)


## MISSING VALUES 

In [19]:
house.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
cum_house_price                       0
cum_house_price_max                   0
dtype: int64

In [20]:
house.price.isna()

0        False
1        False
2        False
3        False
4        False
         ...  
48890    False
48891    False
48892    False
48893    False
48894    False
Name: price, Length: 48895, dtype: bool

In [21]:
house.isna().any()

id                                False
name                               True
host_id                           False
host_name                          True
neighbourhood_group               False
neighbourhood                     False
latitude                          False
longitude                         False
room_type                         False
price                             False
minimum_nights                    False
number_of_reviews                 False
last_review                        True
reviews_per_month                  True
calculated_host_listings_count    False
availability_365                  False
cum_house_price                   False
cum_house_price_max               False
dtype: bool