##### Installing Libraries

##### In order to use this notebook you can use below commands for installing the libraries in case you don't have already
- `!pip install pandas`
- `!pip install vega_datasets`
- `!pip install matplotlib`
- `!pip install folium==0.1.5`
- `!pip install altair`

##### Importing Libraries

In [2]:
import pandas as pd
import os
import altair

##### Pandas can read files from different sources , formats . But I will be providing examples for reading excel, csv,tsv

- Read csv file
  `pd.read_csv('path/filename.csv')`
- Read excel file
  `pd.read_excel(r'path/filename.xlsx')` or `pd.read_excel(r'path/filename.xls')`
- Read Text file with separator (e.g . text file with tab separated)
   `pd.read_csv(r'path/file_name.txt',sep='\t')`

#####  *We will be using  New York City Airbnb Open Data* .For more info. please visit  [Kaggle url](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data) 

In [3]:
airbnb_df = pd.read_csv(r'/run/media/deepak/46d93302-c6c2-4abd-af1c-19c33f3ef9ea/deepu/ABCP/Airbnb data/TrainDeliveries/AB_NYC_2019.csv')

##### In order to have more information on dataframe , we can use either `df.info()` or `df.describe()`

In [4]:
airbnb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

In [5]:
airbnb_df.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


#####  Let's check the how many rows and columns are there in the dataframe by running below cell. It will return a tuple with (rows, columns) details

In [6]:
airbnb_df.shape

(48895, 16)

##### Great !!! Let's see what are all columns available  in the dataset

In [7]:
airbnb_df.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

###### Let's check few records from the beginning . You can use  `dataframe.head()`    which will give 5 (default) records from the beginning . You can modify default behaviour of no. of records returned by passing record numbers inside head  `df.head(10) or df.head(7)`

In [8]:
airbnb_df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


###### Let's check few records from the end . You can use  `dataframe.tail()`    which will give 5 (default) records from the end . You can modify default behaviour of no. of records returned by  passing record numbers inside tail  `df.tail(10) or df.tail(7)`

In [9]:
airbnb_df.tail()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2
48894,36487245,Trendy duplex in the very heart of Hell's Kitchen,68119814,Christophe,Manhattan,Hell's Kitchen,40.76404,-73.98933,Private room,90,7,0,,,1,23


##### Sometimes you need to play around with dataset and do some modifications . In those cases , it's better to have a copy of the dataset.You can do that by using `df.copy()`

In [10]:
airbnb_bk_df = airbnb_df.copy()

##### Columns can be categorical and numerical. In case you want to  see  the unique values of a categorical column use `df['col_name].unique()` and in order to see number of unique values for the column use `df['col_name'].nunique()`

In [11]:
airbnb_df['room_type'].unique()

array(['Private room', 'Entire home/apt', 'Shared room'], dtype=object)

In [12]:
airbnb_df['room_type'].nunique()

3

##### Filtering data from pandas . There can be many ways that we can filter the data. Below are mentioned few ways which can be use to filter the data

###### This is the first way to filter data 

In [13]:

airbnb_filtered_df = airbnb_df[airbnb_df['room_type']=='Shared room']
print('No. of records with room type shared is equal to ',airbnb_filtered_df.shape[0])

No. of records with room type shared is equal to  1160


In [14]:
# Let's see some records with filtered data 
airbnb_filtered_df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
39,12048,LowerEastSide apt share shortterm 1,7549,Ben,Manhattan,Lower East Side,40.71401,-73.98917,Shared room,40,1,214,2019-07-05,1.81,4,188
203,54453,MIDTOWN WEST - Large alcove studio,255583,Anka,Manhattan,Hell's Kitchen,40.76548,-73.98474,Shared room,105,6,10,2014-01-07,0.09,1,363
357,99070,Comfortable Cozy Space in El Barrio,522065,Liz And Melissa,Manhattan,East Harlem,40.79406,-73.94102,Shared room,65,7,131,2019-05-26,1.31,2,0
492,173072,Cozy Pre-War Harlem Apartment,826192,Lewis,Manhattan,Harlem,40.80827,-73.95329,Shared room,49,3,168,2019-07-06,4.6,1,248
545,200645,Best Manhattan Studio Deal!,933378,Edo,Manhattan,Upper East Side,40.76739,-73.9557,Shared room,90,1,0,,,1,0


###### This is the sceond way of filtering data

In [15]:

airbnb_filtered_df = airbnb_df.query('room_type=="Shared room"')
print('No. of records with room type shared is equal to ',airbnb_filtered_df.shape[0])

No. of records with room type shared is equal to  1160


In [16]:
# Let's see some records with filtered data using the second methood. The no. of records is same like the first method
airbnb_filtered_df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
39,12048,LowerEastSide apt share shortterm 1,7549,Ben,Manhattan,Lower East Side,40.71401,-73.98917,Shared room,40,1,214,2019-07-05,1.81,4,188
203,54453,MIDTOWN WEST - Large alcove studio,255583,Anka,Manhattan,Hell's Kitchen,40.76548,-73.98474,Shared room,105,6,10,2014-01-07,0.09,1,363
357,99070,Comfortable Cozy Space in El Barrio,522065,Liz And Melissa,Manhattan,East Harlem,40.79406,-73.94102,Shared room,65,7,131,2019-05-26,1.31,2,0
492,173072,Cozy Pre-War Harlem Apartment,826192,Lewis,Manhattan,Harlem,40.80827,-73.95329,Shared room,49,3,168,2019-07-06,4.6,1,248
545,200645,Best Manhattan Studio Deal!,933378,Edo,Manhattan,Upper East Side,40.76739,-73.9557,Shared room,90,1,0,,,1,0


#####  As you saw in previous steps after filtering the data, the index are changed .So ,it is good idea to reset index . You can differentiate the results by seeing index chnages . Before `df.reset_index(drop=True,inplace=True)` in the airbnb_filtered_df the indexs were (39,203,357,492,545) and after using the reset_index we have index (0,1,2,3,4) for first 5 records 

In [22]:
airbnb_filtered_df.reset_index(drop=True,inplace=True)

In [23]:
airbnb_filtered_df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,12048,LowerEastSide apt share shortterm 1,7549,Ben,Manhattan,Lower East Side,40.71401,-73.98917,Shared room,40,1,214,2019-07-05,1.81,4,188
1,54453,MIDTOWN WEST - Large alcove studio,255583,Anka,Manhattan,Hell's Kitchen,40.76548,-73.98474,Shared room,105,6,10,2014-01-07,0.09,1,363
2,99070,Comfortable Cozy Space in El Barrio,522065,Liz And Melissa,Manhattan,East Harlem,40.79406,-73.94102,Shared room,65,7,131,2019-05-26,1.31,2,0
3,173072,Cozy Pre-War Harlem Apartment,826192,Lewis,Manhattan,Harlem,40.80827,-73.95329,Shared room,49,3,168,2019-07-06,4.6,1,248
4,200645,Best Manhattan Studio Deal!,933378,Edo,Manhattan,Upper East Side,40.76739,-73.9557,Shared room,90,1,0,,,1,0
