# Introduction

The world is running on data. Data can be anything- numbers, documents, images facts, etc. It can be in digital or in any physical form. Data is plural of **datum** which means *something given*. 

Data (raw data) is only useful if we interpret it to get the information we desired. This information will help an organization to design strategies based on facts and trends. 

With the advancements in **Python** packages and their ability to perform higher-end analytical tasks, it has become a go-to language for the data analysts.

- Important Data Analysis libraries.
- Data Pre-processing
- Exploratory Data Analysis

Data scientists and analysts spend most of their time on data pre-processing and visualization. Model building is much easier.

## Important Data Analysis libraries

What makes Python useful for the data? It contains the packages and libraries which are open-source and are widely used while crunching the data. Let us know more about them first.

**Fundamental Scientific Computing-** 

1. `Numpy`- It stands for **Num**ric **Py**thon. The library is capable of performing random numbers, linear algebra, and Fourier Transform. 

2. `SciPy`- It stands for **Sci**tific **Py**thon. It contains high-level science and engineering module. You can perform linear algebra, optimization, fast Fourier transforms. SciPy is built on NumPy.

**Data Manipulation and Visualization-** 

3. `pandas` - In data analysis and machine learning pandas are used in the form of data frames. It allows you to read the data from different file formats like- CSV, Excel, plain text, JSON, SQL, etc. 

4. `Matplotlib`- It is used for plotting and visualization of the data. You can plot histograms, graphs, line plots, heatmaps and lot more. It can be embedded in GUI toolkits. 

**Machine Learning-** 

5. `Sci-kit Learn`- It is a free machine learning library. Scikit learn is built on NumPy, SciPy, and matplotlib. It contains efficient tools for statistical model building. It can run various classification, regression and clustering algorithms. It integrates well with Pandas while working on data frames.

## Importing Libraries and Loading the data

In [0]:
!pip install geopandas

In [0]:
from __future__ import division
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('new-york-city-airbnb-open-data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
import geopandas as gpd

from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

sns.set_style('darkgrid')

## Exploratory Data Analysis (EDA) and Data Preprocessing

In data analysis, EDA is used for getting a better understanding of data. Looking at the data the questions may arise on, how many rows and columns are there? Is the data numeric? What are the names of the features (columns)? Are there any missing values, text and numeric symbols inappropriate to the data?

The **shape** and **info** class is the answer we are looking for. **head** function will display the first 5 rows of the data frame and **tail** function will display the last 5. The class **describe** function will give the statistical summary of the dataset. To split the data by groups giving certain cretarias we will use **groupby()** function.

First lets read our data.

In [0]:
data = pd.read_csv(r'../input/new-york-city-airbnb-open-data/AB_NYC_2019.csv')

print('Number of features: %s' %data.shape[1])
print('Number of examples: %s' %data.shape[0])

In [0]:
data.head().append(data.tail())

In [0]:
data.info()

In [0]:
data.describe()

### Evalution of the data

Know who are the best hosts, most reviewed hosts and top neighbourhood group.

In [0]:
data.isnull().sum()

In [0]:
# Evaluation_1-top_3_hosts

top_3_hosts = (pd.DataFrame(data.host_id.value_counts())).head(3)
top_3_hosts.columns=['Listings']
top_3_hosts['host_id'] = top_3_hosts.index
top_3_hosts.reset_index(drop=True, inplace=True)
top_3_hosts

In [0]:
# Evaluation_2-top_3_neighbourhoood_groups

top_3_neigh = pd.DataFrame(data['neighbourhood_group'].value_counts().head(3))
top_3_neigh.columns=['Listings']
top_3_neigh['Neighbourhood Group'] = top_3_neigh.index
top_3_neigh.reset_index(drop=True, inplace=True)
top_3_neigh

In [0]:
# Evaluation_3-most_reviewed_hosts

rev_group = pd.DataFrame(data.groupby('number_of_reviews').sum()['host_id'])
most_reviewed = (rev_group.sort_values('number_of_reviews',ascending=False)).head(3)
most_reviewed.columns = ['Host ID']
most_reviewed['Number of reviews'] = most_reviewed.index
most_reviewed.reset_index(drop=True, inplace=True)
most_reviewed

The word cloud will show the collection of the words written in the reviews. Larger the size of the word more its used. install word cloud library.

In [0]:
from wordcloud import WordCloud, ImageColorGenerator
wordcloud = WordCloud(
                              background_color='beige'
                         ).generate(" ".join(data.neighbourhood))
plt.figure(figsize=(25,10))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('neighbourhood.png')
plt.show()

# Visualisations

It is said that we understand faster when we visualized the data. In the following code we will work on the following types of plots.

- [Pie](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.pie.html)
- [Histogram](https://matplotlib.org/3.1.3/api/_as_gen/matplotlib.pyplot.hist.html)
- [Barplot](https://seaborn.pydata.org/generated/seaborn.barplot.html)
- [Boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html)
- [Scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html)

In [0]:
fig = plt.figure(figsize = (15,10))
ax = fig.gca()
data.hist(ax=ax)
plt.show()

In [0]:
labels = data.neighbourhood_group.value_counts().index
colors = ['lightblue','beige','lightgreen','orange','cyan']
explode = [0,0,0,0,0]
sizes = data.neighbourhood_group.value_counts().values

plt.figure(0,figsize = (7,7))
plt.pie(sizes, explode=[0.1,0.0,0.3,0.5,0.0], labels=labels, colors=colors, autopct='%1.1f%%',shadow=True)
plt.title('Neighbourhood Group',color = 'black',fontsize = 15)
plt.show()

In [0]:
# Visualisation1- Neighbourhood_groups-room_type

plt.figure(figsize=(15,6))
sns.countplot(data=data, x='neighbourhood_group', hue='room_type', palette=sns.color_palette("Set3", n_colors=3))
plt.title('Counts of neighbourhoods vs room type', fontsize=15)
plt.xlabel('Neighbourhood group')
plt.ylabel("Count")
plt.legend(frameon=False, fontsize=12)

In [0]:
#neighbourhood_group-price
result = data.groupby(["neighbourhood_group"])['price'].aggregate(np.median).reset_index().sort_values('price')
sns.barplot(x='neighbourhood_group', y="price", data=data,palette=colors, order=result['neighbourhood_group']) 
plt.xticks(rotation=45)
plt.show()

In [0]:
#neighbourhood_group-availability_365
result = data.groupby(["neighbourhood_group"])['availability_365'].aggregate(np.median).reset_index().sort_values('availability_365')
sns.boxplot(x='neighbourhood_group', y="availability_365", data=data) 
plt.show()

In [0]:
plt.figure(figsize=(10,6))
sns.scatterplot(data.longitude,data.latitude,hue=data.neighbourhood_group)
plt.ioff()

In [0]:
plt.figure(figsize=(10,6))
sns.scatterplot(data.longitude,data.latitude,hue=data.availability_365)
plt.ioff()