![Solvay Logo](https://upload.wikimedia.org/wikipedia/commons/thumb/9/9b/Solvay_Brussels_School_logo.svg/1280px-Solvay_Brussels_School_logo.svg.png)
# TP 5 Data Quality
## Background

You did it! You managed to retrieve information either from the web or from another source. Things are getting pretty serious.

Usually, you will see that you have to preprocess the data before using it in your models. This is the goal of the session: making sure we have clean and well-formatted data to work with.

In this session, we will talk about the following issues or challenges arising from real-world data:
1. Missing Values
2. Outliers
3. Duplicates
4. Frequent transformation and formatting

# Missing Values

One common issue are the missing values.

The causes are multiple:
* For survey, certain people leave answers blank
* Some information is unknown for certain observations (i.e. shares of different religious groups are public in certain countries but not in other)
* When reconciling data from different sources, some info may be present in one source and not in other
* There may have been mishandling of data
* etc.

There are several ways to get rid of those. But one has to be careful about their impact on the final result and always perform checks to assess the consequences of the chosen method.

In the following exercises, we will build an hypothetical example and see what are the consequences of several ways of handling missing values.

In this scenario, we have two independent variables (or features) and one dependent variable (or target). We want to identify the coefficients that enable us to retrieve the dependent variable with basis on the independent ones.

Imagine the following scenario: children and dogs both eat biscuits, while parents don't. You want to form an estimate of the money each family is going to spend on biscuit in a given year. You strongly suspect that there is a direct linear correlation between the number of children in a family and the total spent on biscuits as well as the number of dogs a family has and the total spent on biscuits.

In [None]:
# The following code will generate the example. 
#If you execute this notebook, please make sure you run this cell before trying to run the rest of the code
import numpy as np
import pandas as pd
from sklearn import linear_model as lm
import scipy as sp
np.random.seed(11)
# We will build a simple linear regression model of the kind: Y = a * X_1 + b * X_2 where X_2 contains missing values
a = 3
b = 2
lambda_1 = 2
lambda_2 = 1
X_1 = np.random.poisson(lambda_1, 100)
X_2 = np.random.poisson(lambda_2, 100).astype('O')
Y = a*X_1 + b* X_2 + np.random.normal(0, 1.5, 100)
missing = np.random.binomial(1, 0.3, 100).astype('bool')
data = pd.DataFrame({"SpentOnBiscuits" : Y, "NChildren" : X_1, "NDogs" : X_2})

In [None]:
data.head()

In [None]:
Y = data.SpentOnBiscuits
X = data.loc[:,['NChildren', 'NDogs']]
regr = lm.LinearRegression()
res = regr.fit(X, Y)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n"+
     "b_hat = " +str(res.coef_[1]))

But there is a catch: Several people don't want to report the number of dogs they have to avoid attracting thiefs that would steal their canine companions (a similar problem is not observed for children, and those are reported faithfully).

Around 30% of the population is suspicious and do not want to disclose the number of dogs they have. We have to form a prediction nonetheless.

In [None]:
missing  # we will erase some data that is missing

In [None]:
X_2[missing] = np.NaN
data.NDogs = X_2

In [None]:
data.head()

## Option 1: Deleting any line with missing value 

This is the radical option! You can delete every line where there is a missing value. It will usually work well when you have few missing values. The good thing is that you don't have to guess what the missing values would have been. The bad thing is that you "lose" information that may have been useful for estimating the other coefficients as well (be weary though, keeping them may introduce bias in your results).

Let's do this! It is done very easily using the function *dropna()*. To use this, you can do it this way:

data = data.dropna()

In this query, you are applying the function *dropna()* coming from the pandas library on the dataframe and assigning this newly created dataframe to the variable called *data*. Note that just invoking *data.dropna()* won't change the dataframe as this functions returns a copy of the dataframe with less rows, it does not modify the existing dataframe (remember your programming class from BA3?).

Try it by yourself:

In [None]:
print("Before dropping missing values, we have " + str(len(data)) + " rows")
data1 = data.dropna()
print("After dropping missing values, we have " + str(len(data1)) + " rows")

We can then use the dataset without missing values to perform the analysis (don't worry if you don't remember how a regression works, we'll see that in a later session).

In [None]:
Y = data1.SpentOnBiscuits
X = data1.loc[:,['NChildren', 'NDogs']]
regr = lm.LinearRegression()
res = regr.fit(X, Y)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n"+
     "b_hat = " +str(res.coef_[1]))

Note that those results are pretty similar to the one we find on the original dataset. In this case, dropping the missing values seems to be a good approach.

## Option 2: Try to approximate the missing values

A somewhat less radical but equally sensitive approach is to fill in missing values with other values. Typical choices are:
* Mean
* Median
* Mode
* Min or max
* Interpolation

The choice of which statistics or value you'll chose depends on the context and always warrant testing (it is usually easy to test several solutions for the same problem).

The most frequent choices are the mean (for continuous variables) or mode (for discrete variables). Fortunately, substituting the missing values by whatever you want is relatively easy.

With pandas, you can use the method *fillna()* to do just that.

Here is [the link to the documentation for the fillna()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)

Let's do the first one together.

We will start by filling in the missing values with the mean of the column *NDogs*.

In [None]:
~data.NDogs.isna()

In [None]:
# Compute the mean of the field and assign it to every missing value
mean_dogs = np.mean(data.NDogs[~data.NDogs.isna()])
data2 = data.fillna(mean_dogs)
data2.head()

In [None]:
# Run the regression
Y = data2.SpentOnBiscuits
X = data2.loc[:,['NChildren', 'NDogs']]
regr = lm.LinearRegression()
res = regr.fit(X, Y)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n"+
     "b_hat = " +str(res.coef_[1]))

Your turn: instead of assigning the mean to the missing value, try assigning the mode. Use the function *sp.stats.mode(x)\[0\]\[0\]* to compute it. Just replace the X with the field of which you'd like to retrieve the mode.

Once this is done, run the next cell and find the estimated coefficients. Is using the mode more accurate than using the mean?

Here is the [link to the documentation of scipy.stats.mode()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html) for reference.

In [None]:
# Compute the mode of the field and assign it to every missing value
mode_dogs = 
data3 = 

In [None]:
data3.head()

In [None]:
# Run the regression
Y = data3.SpentOnBiscuits
X = data3.loc[:,['NChildren', 'NDogs']]
regr = lm.LinearRegression()
res = regr.fit(X, Y)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n"+
     "b_hat = " +str(res.coef_[1]))

### Interpolation

Sometimes, the sequence of the observations has a meaning. Think about sequential observations. If one of the independent variable you are observing are samples coming from a continuous process and the frequency of observation is high enough, it is likely that a missing observation finds itself between two sequential observations.

Imagine you're observing the outside temperature every hour and use it to estimate the number of soda cans a vending machine is selling. The issue is that sometimes, the thermometer sensor has a failure and fails to transmit the temperature reading sometimes.

You may imagine that, for lack of a better model, each time you have a "hole" in the data, the temperature is between the previous one and the next one. Usually, a linear approximation is "good enough" for most purposes.

In [None]:
# Let's build a pet example once again. The temperature is represented as the product of two sinusoidal function
np.random.seed(11)
stops = np.arange(0, 80, 0.8)
a = 8
b = 5
X_1 = np.random.normal(14, 2, 100)
X_2 = 10 * np.sin(stops) * np.sin(stops + np.random.randn(100)/4)
Y = a * X_1 + b * X_2 + np.random.normal(0,0.5,100)
missing = np.random.binomial(1, 0.1, 100).astype('bool')
missing[0] = False # Otherwise, we can't interpolate the first value
missing[-1] = False # same
soda = pd.DataFrame({"sales" : Y, "demand" : X_1, "temperature" : X_2})
regr = lm.LinearRegression()
res = regr.fit(soda.loc[:,["demand", "temperature"]], soda.sales)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n" +
     "b_hat = " + str(res.coef_[1]))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(X_2, marker='o')

In [None]:
X_2[missing] = np.NaN
soda.temperature = X_2

In [None]:
soda.head()

We'll only see the linear interpolation. but there are other type. Doing it with Pandas is, once again, fairly easy. You can use the *interpolate()* function in pandas to do that in one instruction. Let's do this one together.

But before that, try practicing and remove the missing values to see what the result of the regression would be without the missing value rows.

In [None]:
soda2 = soda.dropna()
regr = lm.LinearRegression()
res = regr.fit(soda2.loc[:,["demand", "temperature"]], soda2.sales)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n" +
     "b_hat = " + str(res.coef_[1]))

Let's do this! [The documentation for the interpolation is under this link](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html).

In [None]:
temp = # extract the temperature column
soda3 = # copy the dataframe
soda3.temperature = # use interpolation

In [None]:
regr = lm.LinearRegression()
res = regr.fit(soda3.loc[:,["demand", "temperature"]], soda3.sales)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n" +
     "b_hat = " + str(res.coef_[1]))

In [None]:
soda3.head()

# Outliers 

Missing values are not the only issue you'll face when using data. Outliers are an important topic.

## What is an outlier? 

The issue is there is no simple or even single definition. Intuitively , it is a value that should not have this value.

That's vague, but it is better to think of examples.

Some outliers can come from
* Measurement errors
* Data handling errors
* Abnormal situation during the observation (with a caveat)
* etc.

Many time, outliers are not too worrisome. However, things can become messy very fast. Consider the following example: We want to find a correlation between height (in meters) and weight (in kg) for people with a regular BMI (neither very over or underweight).

This relation is approximately linear (with an error). The issue is that, for some observation, the person in charge of collecting the data recorded the height in centimeters.

In [None]:
# The following code will generate the example. 
#If you execute this notebook, please make sure you run this cell before trying to run the rest of the code
np.random.seed(11)
# We will build a simple linear model of weight as a function of the height. The issue is that the 7th observation was recorded in centimers
a = 40
height = 1.7 + np.random.normal(0, 0.15, 50)
weight = a* height + np.random.normal(0, 10, 50)
height[6] = height[6]*100
data = pd.DataFrame({"Height" : height, "Weight" : weight})
data['sex'] = "M"

In [None]:
data.head(10)

In [None]:
# Run the regression
Y = data.Weight
X = data.Height
X = np.asarray(X).reshape(-1,1)
regr = lm.LinearRegression()
res = regr.fit(X, Y)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n")

That's pretty bad...

We can confirm the result we would have if we had the correct value.

In [None]:
# Run the regression
Y = data.Weight
height2 = data.Height.copy()
height2[6] = height2[6]/100
X = height2
X = np.asarray(X).reshape(-1,1)
regr = lm.LinearRegression()
res = regr.fit(X, Y)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n")

Surely we can do something to avoid the problem? Well yes, but before we do, we need to be very confident that what we observe is indeed an error and, if possible, try to fix it. Alternatively, if we observe something that is so unusual that you don't believe it useful to consider it in your model, you may want to take it out or to "tune it down".

However, your model will be blind to that kind of occurence. Instead, you may want to use more advanced methods to consider this alternative method in your prediction (for example, you can use what is called Hidden Markov Models, that we will not see in this course).

## How to identify outliers?

Sometimes, you just don't have time to check each entry of the data to determine if it is an outlier. There are some techniques that allow you to narrow the scope of the search.

### The graphical one

Sometimes, you can look at the histogram of the features through a series of boxplot (remember the session about visualization?) to identify outliers.

In [None]:
import matplotlib.pyplot as plt
plt.boxplot(data.Height);

Clearly, we see that one of the value (the round at the top of the plot) is very different from the other (much higher than the 75-percentile).

The issue is that we don't know which one observation it is. And we need this info to treat it.

### Statistics-based methods

Only numeric columns can be considered for statistics-based methods.

In [None]:
# Keep only columns that are numeric
data = data.select_dtypes(include=['number'])
data.head()

A first method could be to look at the different values of the column Height.

In [None]:
# Divide the Height into Bins to spot more easily the outliers
data['Height'].value_counts().sort_index()

A second method might be to look individually at the *n* maximum and minimum values (you'll have to select *n*). This is fairly easy to do with pandas and numpy. Just use *argpartition()*, a function that retrieve the index of minimums (but with negative arguments, you can retrieve the maximums).

In [None]:
arr = np.asarray(data.Height)
ind_max = np.argpartition(arr, -3)[-3:] # We select the 3 largest values
print("Indices: " + str(ind_max) + "\n")
print("Values: " + str(arr[ind_max]) + "\n")

In [None]:
arr = np.asarray(data.Height)
ind_min = np.argpartition(arr, 3)[:3] # We select the 3 smallest values
print("Indices: " + str(ind_min) + "\n")
print("Values: " + str(arr[ind_min]) + "\n")

But this is unsatisfactory: you still have to do a lot of manual work: If all the values you find are outliers, you need to check if the next one is an outlier too.

You can use a more "automatic" way. However, it is predicated on strong assumption on your features: namely that they are distributed normally.

The idea here is to compute the mean and standard deviation of the sample and to analyze everything that falls outside $[\bar{x} - 2 \hat{\sigma}; \bar{x} + 2 \hat{\sigma}]$. Note, however, that both $\bar{x}$ and $\hat{\sigma}$

In [None]:
x_bar = np.mean(data.Height)
sigma_hat = np.std(data.Height)

In [None]:
x_bar, sigma_hat

In [None]:
outlier = (data.Height < x_bar - 2 * sigma_hat) | (data.Height > x_bar + 2 * sigma_hat)

In [None]:
outlier

In [None]:
data.loc[outlier]

As you can see, in this case, the procedure identified the outlier only, but be weary of the skew that outliers can induce on the mean and observed standard deviation lest you'll lose some good quality data.

We can define a function so that we can more easily reuse the code created. 

In [None]:
# Define a function to find outliers in a numeric column of a data frame
def find_outliers(df, col, print_outputs = True):
    x_bar = np.mean(df[col])
    sigma_hat = np.std(df[col])
    outlier = (df[col] < x_bar - 2 * sigma_hat) | (df[col] > x_bar + 2 * sigma_hat)

    if print_outputs:
        if outlier.sum() > 0:
            print(f'For column {col} we have {outlier.sum()} outliers: \n{df[outlier][col]}')
    else:
        return outlier

In [None]:
find_outliers(data, 'Height')

## How to treat the outliers?

Once again, there is not one single solution: you'll probably have to experiment. There are two main class of ways to deal with outliers:
* delete them
* bring them to values that are not outliers

The first way is radical in some ways and may not be practical if your dataset is small or if you have many features (why throw away many useful value because one field is corrupted?).

We can negate the output of the procédure we've done at the last step: *~outlier* will have the value true for any non-outlier and false for the outlier. You can then take the indexed dataset to keep only the non-outliers.

In [None]:
non_outliers = ~outlier
print("Before dropping the outliers, the dataset has " + str(len(data)) + " rows.")
new_data = data.loc[non_outliers].copy()
print("After dropping outliers, there remains " + str(len(new_data)) + " rows.")

Good!

We could also decide to bring the rows outside the two "bands" defined by $\bar{x} - 2 \hat{\sigma}$ and $\bar{x} + 2 \hat{\sigma}$ by setting values outside thoses limits to those limits. This is done relatively easily as such:

In [None]:
min_band = np.mean(data.Height) - 2 * np.std(data.Height)
max_band = np.mean(data.Height) + 2 * np.std(data.Height)
data.loc[data.Height < min_band, "Height"] = min_band
data.loc[data.Height > max_band, "Height"] = max_band

In [None]:
data.loc[6]

As you can see, this is not ideal: the person we registered as being 160 m tall is now "only" 49 m. It improved the matter, but still...

To solve this issue, you could, for example, set the value of the values outside the band to the higher (or smaller) value inside the band. This can be done by combining the two approaches above and left as an exercise (basically: drop the outliers, take the maximum and the minimum values of the feature, store there somewhere and then assign them to the rows outside the bands instead of assigning min_band and max_band).

# Duplicates 

The topic of duplicate is somewhat easier to manage.

The question is why do you have duplicates in your data? Are they legitimate?

Everything depends on how you, or the person who retrieved the data, worked: Remember the class on SQL? Well, if your dataset has a primary key or something like it, you may be confident that duplicates are not wanted and you may get rid of them promptly.

On the other hand, if you're not too sure whether or not the duplicates ought to be there, you can always perform sensitivity analysis: Once you decide whether or not to keep it, do your analysis. Once it's done, apply the same pipeline on the data by adding or substracting the data (depending on what you had chosen). If the conclusions or performance of the models remain the same, no issue. If not, you really need to dig deeper to understand whether or not duplicates are legitimate.

## How to get rid of duplicates? 

The good news is that getting rid of duplicates is easy! You can use pandas' *drop_duplicates* function and be done with it. Let's work a little example.

In [None]:
# The following code will generate the example. 
#If you execute this notebook, please make sure you run this cell before trying to run the rest of the code
np.random.seed(11)
# We will build a simple linear model of weight as a function of the height. The issue is that the 7th observation was recorded in centimers
a = 40
height = 1.7 + np.random.normal(0, 0.15, 500)
errors = np.random.normal(0, 10, 500)
weight = a* height + errors
data = pd.DataFrame({"Height" : height, "Weight" : weight})
# Because it was in two databases, the data regarding obese people (the ones for which the error is the highest) 
# was imported twice:
heavy = data.loc[errors >= np.quantile(errors, 0.75)]
data = data.append(heavy)

In [None]:
Y = data.Weight
X = data.Height
X = np.asarray(X).reshape(-1,1)
regr = lm.LinearRegression()
res = regr.fit(X, Y)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n")

We can drop the duplicates and run it again. We can do this in SQL or in pandas:

First SQL: (you will be using pandasql for assignement 1)

In [None]:
!pip install -U pandasql

In [None]:
from pandasql import sqldf
pysql = lambda q: sqldf(q, globals()) # define function to execute sql on dataframe

In [None]:
datasql = pysql(''' select distinct Height, Weight
              from data
              ''') # execute your SQL query, it'll return a pandas dataframe

In [None]:
Y = datasql.Weight
X = datasql.Height
X = np.asarray(X).reshape(-1,1)
regr = lm.LinearRegression()
res = regr.fit(X, Y)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n")

Or with pandas

In [None]:
data = data.drop_duplicates()

In [None]:
Y = data.Weight
X = data.Height
X = np.asarray(X).reshape(-1,1)
regr = lm.LinearRegression()
res = regr.fit(X, Y)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n")

The gain is small because of the structure of the dataset and the error, but, with particularly noisy datasets and on certain algorithms (that we will see in the next lectures), the impact may be much larger.

# Formatting for analysis

The last topic for today is the broader topic of formats. Different types of algorithms and different problems require different data format. For example, classification problems usually require the target to be a factor or a string. For time seriers, there are specific date formats that must be assigned to the data in order to work.

## "Casting" of a column

In programming, the action of constraining data from one type to another is usually referred to as "casting". For example, it is easy to cast an integer, for example *12* into a string, in this case "12". Working the other way around is more tricky.

In pandas (and with numpy series in general), you can cast by using the method *astype(t)* where *t* is a string representing the type of data you want to convert the series to.

Imagine you have a column in your dataframe representing whether the person loves hamburger or not. It is currently coded as an integer: 1 means the person likes hamburgers and 0 means she doesn't.

In [None]:

loves_hamburgers = pd.DataFrame(np.random.binomial(1, 0.8, 100), columns=["loves_burgers"])
loves_hamburgers

Say we are looking to use an algorithms from a package and, upon reading the docs, we realize that the target column must be coded as a boolean (a True or False value). What to do then?

Easy, just use the *astype(t)* function mentioned before with the type *t* set to 'b' (which stands for boolean).

In [None]:
loves_hamburgers.loves_burgers = loves_hamburgers.loves_burgers.astype('bool')
loves_hamburgers

## Working with time series

Another issue is when time is involved. In this case, you can use the *to_datetime()* function to make sure that the series is casted to a date and hours (up to the nanosecond if this info is available in the original data).

Some considerations to have with the dates and times:
* While the function is "smart" enough to recognize most time format from a string, it may be necessary to help it by specifying the "format" argument to indicate where in the string are the day and month.
* In the original dataset, you'll often have the date formated either as a string or as a integer. The integer usually indicate the number of seconds that have passed since the 1st of January 1970 (known as the UNIX epoch). If you see it in your data, it is not a mistake, just an alternative way to store time.

Let's try this: we'll generate a bunch of dates and try to cast them into real datetime data.

In [None]:
days = np.random.randint(1, 28+1, 100)
months = np.random.randint(1, 12+1, 100)
years = np.random.randint(2000, 2019+1, 100)
strings = np.array([str(days[ii])+"/"+str(months[ii])+"/"+str(years[ii]) for ii in range(0,100)])

In [None]:
strings

In [None]:
data_dates = pd.DataFrame()
data_dates["dates"] = pd.to_datetime(strings)

In [None]:
data_dates

But wait! There's an issue: for days smaller than 12, the function erroneously assumes that it is the month. To go around it, just specify that, when in doubt, the day comes first.

In [None]:
data_dates.dates = pd.to_datetime(strings, dayfirst= True)
data_dates

Sometimes, it is important to retrieve information that derives from the date. For example, it is sometimes important to know which day of the week it was (imagine building a model where we forecast attendance to an amusement park, it is more likely that it will be high on the weekends).

More generally, you may want to break a date into its components. If you want to do so, you can use the following functions:

In [None]:
data_dates['WeekDay'] = data_dates.dates.dt.day_name()
data_dates['WeekDayNum'] = data_dates.dates.dt.dayofweek
data_dates['Day'] = data_dates.dates.dt.day
data_dates['Month'] = data_dates.dates.dt.month
data_dates['Year'] = data_dates.dates.dt.year
data_dates['Hour'] = data_dates.dates.dt.hour

In [None]:
data_dates

Sometimes you also have a bit more specific formats, such as for instance only the month and the year, separated with an forward slash as in the example below:

In [None]:
all_months = ['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC']
months = np.random.choice(all_months, 100)
years = np.random.randint(2000, 2021, 100)
strings = np.array([str(months[ii])+'/'+str(years[ii]) for ii in range(0,100)])
strings

In those cases you can also specify the format of the dates, you can have a view [here](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) on what are the format specifications.

In [None]:
data_dates.dates = pd.to_datetime(strings, format='%b/%Y')
data_dates

# Your turn 

Your friend with whom you're working on an assignment where you try to use data about age, wealth and whether or not the person drives an expensive car to predict the number of ties the person possess, sent you the dirty dataset present in the file "dirty_set.csv". Clean it up and estimate the coefficient of the linear regression.

For the NaN's, substitute them with the mean of their series for continuous variable and the mode of the series for the categorical or boolean ones.

In [None]:
from google.colab import drive
drive.mount('/gdrive')

In [None]:
df = pd.read_csv("/gdrive/MyDrive/STATS406/STATS406 TP5 - dirty_set.csv")

In [None]:
# your cleaning code comes here

In [None]:
from datetime import datetime
ages_in_ns = (datetime.today() - df.date_of_birth) # make sure that the formatted column for the date of birth is called "date_of_birth"
tot_sec_year = 60 * 60 * 24 * 365
years = [ (ii.total_seconds()/tot_sec_year) for ii in ages_in_ns]
df['age'] = years
regr = lm.LinearRegression()
res = regr.fit(df.loc[:,["wealth", "age", "owns_expensive_car"]], df.nb_ties)
print("Estimated coefficients are\n" + 
      "a_hat = " + str(res.coef_[0]) + "\n" +
     "w_hat = " + str(res.coef_[1]) + "\n" +
     "o_hat = " + str(res.coef_[2]))