<a href="https://colab.research.google.com/github/Desmondonam/DS_Python/blob/main/working_with_data_in_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Working with data in Python involves using various libraries and tools to handle, manipulate, analyze, and visualize data. Some of the most popular libraries for data manipulation and analysis in Python are:

## Pandas:
Pandas provides data structures like DataFrames and Series, which are designed to handle and analyze structured data efficiently. It is widely used for data cleaning, filtering, transformation, and aggregation.

## NumPy:
 NumPy is a fundamental library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, as well as a collection of mathematical functions to operate on these arrays.

## Matplotlib:
Matplotlib is a powerful library for creating static, interactive, and animated visualizations in Python. It offers a wide range of plotting options, such as line plots, scatter plots, bar plots, histograms, and more.

## Seaborn:
Seaborn is built on top of Matplotlib and provides a high-level interface for creating attractive statistical visualizations. It is particularly useful for complex visualizations involving statistical relationships.

## SciPy:
SciPy is a library that builds on NumPy and provides additional functionality for scientific and technical computing. It includes modules for optimization, integration, interpolation, and more.

## Scikit-learn:
Scikit-learn is a powerful library for machine learning in Python. It provides a wide range of algorithms for classification, regression, clustering, dimensionality reduction, and more.

## TensorFlow and Keras:
TensorFlow is a deep learning library developed by Google, and Keras is a high-level API that runs on top of TensorFlow. Together, they are widely used for building and training deep neural networks.

## PyTorch:
PyTorch is another popular deep learning library that provides dynamic computation graphs and is widely used for research and production deep learning projects.

## *** Here's a general outline of the steps involved in working with data in Python: ***

1. **Data Collection:** Obtain the data from various sources, such as files (CSV, Excel, JSON, etc.), databases, web APIs, or web scraping.

2. **Data Loading:** Use Pandas or other libraries to load the data into Python data structures like DataFrames for further analysis.

3. **Data Cleaning:** Preprocess the data to handle missing values, duplicates, outliers, and other data quality issues.

4. **Data Exploration:** Use descriptive statistics, visualizations, and plots to gain insights into the data and understand its structure.

5. **Data Manipulation:** Filter, transform, and aggregate the data as needed to prepare it for analysis.

6. **Data Analysis:** Apply statistical techniques, machine learning algorithms, or other data analysis methods to gain insights or make predictions.

7. **Data Visualization:** Create visualizations to communicate the findings and results effectively.

8. **Model Evaluation:** Evaluate the performance of machine learning models using appropriate metrics.

9. **Model Deployment (optional):** If building predictive models, deploy them to make predictions on new data.

10. **Reporting and Communication:** Communicate the findings and results through reports or presentations.

Throughout the process, it's essential to document your code and steps, making it easier to understand and reproduce the analysis. Python's interactive nature and its rich ecosystem of data-related libraries make it a popular choice for data analysis tasks.

## 1. Data Collection


In [2]:
## Reading data
import pandas as pd
url = 'https://raw.githubusercontent.com/Desmondonam/data_bank/main/Data/House_Price.csv'
df = pd.read_csv(url)

In [3]:
#importing data from a web based API using the request module
import requests

# url = 'https://api.census.gov/data.html'
# https://github.com/craigsdennis/intro-to-apis-course/blob/master/course-notes.md
#https://api.ebooks.com/authors/b8a99e2

response = requests.get('https://api.ebooks.com/authors/b8a99e2')
data = response.json()
data.head()

JSONDecodeError: ignored

In [4]:
df.head()

Unnamed: 0,price,crime_rate,resid_area,air_qual,room_num,age,dist1,dist2,dist3,dist4,teachers,poor_prop,airport,n_hos_beds,n_hot_rooms,waterbody,rainfall,bus_ter,parks
0,24.0,0.00632,32.31,0.538,6.575,65.2,4.35,3.81,4.18,4.01,24.7,4.98,YES,5.48,11.192,River,23,YES,0.049347
1,21.6,0.02731,37.07,0.469,6.421,78.9,4.99,4.7,5.12,5.06,22.2,9.14,NO,7.332,12.1728,Lake,42,YES,0.046146
2,34.7,0.02729,37.07,0.469,7.185,61.1,5.03,4.86,5.01,4.97,22.2,4.03,NO,7.394,101.12,,38,YES,0.045764
3,33.4,0.03237,32.18,0.458,6.998,45.8,6.21,5.93,6.16,5.96,21.3,2.94,YES,9.268,11.2672,Lake,45,YES,0.047151
4,36.2,0.06905,32.18,0.458,7.147,54.2,6.16,5.86,6.37,5.86,21.3,5.33,NO,8.824,11.2896,Lake,55,YES,0.039474


## Data Cleaning

In [5]:
## Data cleaning and handling of missing data
df.dropna()  # Drop rows with missing values
df.fillna(0)  # Fill missing values with 0

Unnamed: 0,price,crime_rate,resid_area,air_qual,room_num,age,dist1,dist2,dist3,dist4,teachers,poor_prop,airport,n_hos_beds,n_hot_rooms,waterbody,rainfall,bus_ter,parks
0,24.0,0.00632,32.31,0.538,6.575,65.2,4.35,3.81,4.18,4.01,24.7,4.98,YES,5.480,11.1920,River,23,YES,0.049347
1,21.6,0.02731,37.07,0.469,6.421,78.9,4.99,4.70,5.12,5.06,22.2,9.14,NO,7.332,12.1728,Lake,42,YES,0.046146
2,34.7,0.02729,37.07,0.469,7.185,61.1,5.03,4.86,5.01,4.97,22.2,4.03,NO,7.394,101.1200,,38,YES,0.045764
3,33.4,0.03237,32.18,0.458,6.998,45.8,6.21,5.93,6.16,5.96,21.3,2.94,YES,9.268,11.2672,Lake,45,YES,0.047151
4,36.2,0.06905,32.18,0.458,7.147,54.2,6.16,5.86,6.37,5.86,21.3,5.33,NO,8.824,11.2896,Lake,55,YES,0.039474
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,22.4,0.06263,41.93,0.573,6.593,69.1,2.64,2.45,2.76,2.06,19.0,9.67,NO,9.348,12.1792,Lake and River,27,YES,0.056006
502,20.6,0.04527,41.93,0.573,6.120,76.7,2.44,2.11,2.46,2.14,19.0,9.08,YES,6.612,13.1648,Lake and River,20,YES,0.059903
503,23.9,0.06076,41.93,0.573,6.976,91.0,2.34,2.06,2.29,1.98,19.0,5.64,NO,5.478,12.1912,,31,YES,0.057572
504,22.0,0.10959,41.93,0.573,6.794,89.3,2.54,2.31,2.40,2.31,19.0,6.48,YES,7.940,15.1760,,47,YES,0.060694


In [6]:
## Removing duplicates
df.drop_duplicates()

Unnamed: 0,price,crime_rate,resid_area,air_qual,room_num,age,dist1,dist2,dist3,dist4,teachers,poor_prop,airport,n_hos_beds,n_hot_rooms,waterbody,rainfall,bus_ter,parks
0,24.0,0.00632,32.31,0.538,6.575,65.2,4.35,3.81,4.18,4.01,24.7,4.98,YES,5.480,11.1920,River,23,YES,0.049347
1,21.6,0.02731,37.07,0.469,6.421,78.9,4.99,4.70,5.12,5.06,22.2,9.14,NO,7.332,12.1728,Lake,42,YES,0.046146
2,34.7,0.02729,37.07,0.469,7.185,61.1,5.03,4.86,5.01,4.97,22.2,4.03,NO,7.394,101.1200,,38,YES,0.045764
3,33.4,0.03237,32.18,0.458,6.998,45.8,6.21,5.93,6.16,5.96,21.3,2.94,YES,9.268,11.2672,Lake,45,YES,0.047151
4,36.2,0.06905,32.18,0.458,7.147,54.2,6.16,5.86,6.37,5.86,21.3,5.33,NO,8.824,11.2896,Lake,55,YES,0.039474
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,22.4,0.06263,41.93,0.573,6.593,69.1,2.64,2.45,2.76,2.06,19.0,9.67,NO,9.348,12.1792,Lake and River,27,YES,0.056006
502,20.6,0.04527,41.93,0.573,6.120,76.7,2.44,2.11,2.46,2.14,19.0,9.08,YES,6.612,13.1648,Lake and River,20,YES,0.059903
503,23.9,0.06076,41.93,0.573,6.976,91.0,2.34,2.06,2.29,1.98,19.0,5.64,NO,5.478,12.1912,,31,YES,0.057572
504,22.0,0.10959,41.93,0.573,6.794,89.3,2.54,2.31,2.40,2.31,19.0,6.48,YES,7.940,15.1760,,47,YES,0.060694


## Data Exploration

Getting the descriptive statistics of the data
that is the number of data points, mean, standard deviation, minimum, maximum ...

In [7]:
print(df.describe())

            price  crime_rate  resid_area    air_qual    room_num         age  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean    22.528854    3.613524   41.136779    0.554695    6.284634   68.574901   
std      9.182176    8.601545    6.860353    0.115878    0.702617   28.148861   
min      5.000000    0.006320   30.460000    0.385000    3.561000    2.900000   
25%     17.025000    0.082045   35.190000    0.449000    5.885500   45.025000   
50%     21.200000    0.256510   39.690000    0.538000    6.208500   77.500000   
75%     25.000000    3.677083   48.100000    0.624000    6.623500   94.075000   
max     50.000000   88.976200   57.740000    0.871000    8.780000  100.000000   

            dist1       dist2       dist3       dist4    teachers   poor_prop  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean     3.971996    3.628775    3.960672    3.618972   21.544466   12.653063   
std      2.108532    2.1085