## Hands-On Data Preprocessing in Python
Learn how to effectively prepare data for successful data analytics
    
    AUTHOR: Dr. Roy Jafari 

### Chapter 12: Data Fusion & Data Integration 
#### Excercises

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Excercise 1
In your own words, what is the difference between Data Fusion and Data Integration? Give examples other than the ones in this chapter. 


# Excercise 2
Answer the following question about **Challenge 4: Aggregation mismatch**. Is this challenge a data fusion one, a data integration, or both? Explain.

# Excercise 3
How come **Challenge 2: Unwise data collection** is somehow both a data cleaning step and a data integration? Do you think it is essential that we categorize if an unwise data collection should be under data cleaning or data integration? 

# Excercise 4
In Example 1 of this chapter, we used multi-level indexing using Date and Hour to overcome the index mismatched formatting challenge. For this exercise, repeat this example but this time use a single level indexing using python DataTime object.

# Excercise 5
Recreate **Figure 5.23** from **Chapter 5 Data Visualization**, but this time instead of using *WH Report_preprocessed.csv*, integrate the following three files yourself first: *WH Report.csv*, *populations.csv*, and *Countires.csv*. Hint: information about happiness indices come from *WH Report.csv*, information of the countries content comes from *Countires.csv*, and population information comes from *populations.csv*. 

# Excercise 6
In **Chapter 6, Exercise 2**, we used *ToyotaCorolla_preprocessed.csv* to create a model that predicts the price of cars. In this exercise, we want to do the preprocessing ourselves. Use *ToyotaCorolla.csv* to perform the following steps.

    a.	Are there any concerns regarding Level Ⅰ data cleaning? If yes, address them if necessary. 
    b.	Are there any concerns regarding Level Ⅱ data cleaning? If yes, address them if necessary. 
    c.	Are there any concerns regarding Level Ⅲ data cleaning? If yes, address them if necessary. 
    d.	Are there attributes in ToyotaCorolla.csv that can be considered redundant? 
    e.	Apply LinearRegression from sklearn.linear_model. Did you have to remove the redundant attributes? Why/Why now?
    f.	Apply MLPRegressor from sklearn.neural_network. Did you have to remove the redundant attributes? Why/Why now?


# Excercise 7
We would like to use the file *Universities.csv* to cluster the universities into two meaningful clusters. However, the data source has many issues including data cleaning levels Ⅰ - Ⅲ and data redundancy. Perform the following steps.

    a.	Deal with data cleaning issues
    b.	Deal with data redundancy issues
    c.	Use any column necessary except State and Public (1)/ Private (2) to find the two meaningful clusters.
    d.	Perform centroid analysis and give a name to each cluster.
    e.	Find if the newly created categorical attribute cluster has a relationship with either of the two categorical attributes we intentionally did not use for clustering: State or Public (1)/ Private (2).


# Excercise 8

In this exercise, we will see an example of data fusion. The case study that we will use in this exercise was already introduced under Data Fusion Example in this chapter, please go back and read it again before continuing with this exercise. 
In short, in this example, we would like to integrate Yeild.csv and Treatment.csv to see if the amount of water that can impact the amount of yield.
Perform the following steps to make this happen.


    a.	Use pd.read_csv() to read Yeild.csv into yield_df, and read Treatment.csv into treatment_df.

    b.	Draw a scatterplot of the points in treatment_df. use the dimension of color to add the amount of watter that has been dispensed from each point. 

    c.	Draw a scatterplot of the points in yield_df. use the dimension of color to add the amount of harvest that has been collected from each point.

    d.	Create a scatterplot that combines the visual in b and c.

    e.	From the scatterplots in the preceding steps, we can deduce that the water stations are within an equidistant space from one another. Based on this realization, calculate the equidistant diameter between the water points, and call it radius. We are going to use this variable in the next steps of calculations.

    f.	First, use the following code to create the function calculateDistance(). 


```import math
def calculateDistance(x1,y1,x2,y2):
    dist = math.sqrt((x2 - x1)**2 + (y2 - y1)**2)
    return dist```

    Then, use the following code and the preceding function we just created, create the function waterRecieved() so we can apply it to the function to the rows of treatment_df. 
  
  
```def WaterReceived(r):
    w = 0
    for i, rr in treatment_df.iterrows():
        distance = calculateDistance(rr.longitude,
                                     rr.latitude,
                                     r.longitude,
                                     r.latitude) 
        if (distance< radius):
            w= w + rr.water * ((radius-distance)/radius)
    return w```

g.	Apply **waterRecieved()** to the rows of **yeild_df**, and add the newly calucated value for each row under the column name water.

    h.	Study the newly updated yeild_df. You were just able to fuse these two data sources. Go back and study these steps, especially the creation of function waterRecieved(). What are the assumptions that made this data fusion possible?

Answer: 

    i.	Draw the scatter plot of the two attributes yeild_df.harvest and yeild_df.water. Do we see an impact from yeild_df.water on yeild_df.harvest?

j.	Use correlation coefficient to confirm your observation from the previous step. 