# Data Extraction and Regression
The following notebook contains the codes for generating the data set for the regression analysis and the analysis itself. This notebook will also contain instructions on how to reporduce the study if you wish to test it with your own pictures.

## Step 0: Give AmazonSagemaker the required access to services (skippable if already done so)
For this analysis we will use Amazon S3 and Amazon Rekognition so we need to give Amazon SageMaker the access to these services through Identity and Access Management (IAM). 

First, open IAM from your Console and click `Roles` on the left panel. Next, find the `AmazonSageMaker-ExecutionRole-xxxxxxxx` role (the last few digits differ for everyone). Now you should be seeing this image

![roleaccess](./step0-2.jpg)

The ones we need are `AmazonRekognitionFullAccess` and `AmazonS3FullAccess` so click `Attach policies` and add them

## Step 1: Prepare the data for analysis
You can make your new buckets to store the images with codes but it is easier to use the graphical UI for this.

Open S3 from your console and click `Create bucket` to create the buckets. There isn't much configuration here but remember that the name for the bucket must be unique so add a name or number at the end of the bucket name. For our project, we will name our buckets "source-image-marinara" and "target-image-marinara". We need to create two buckets, one for storing the source image(s), and one for taget images. You should see two new buckets when you finish creating them.

![newbuckets](./step1-3.jpg)

You can now click into the buckets and start uploading the images. For our project, we collected samples from 4 people, each with 10 pictures edited through Meitu, an app with a feature of beautifying protraits.

![imagedata](./step1-4.jpg)

Natrually we have 4 different source images. If you want to reproduce the test, it is recommended that you label your images with numbers.

For the next step, open an excel file and enter the beautifying parameters accordingly. These data will have to be obatined manually because the picture does not contain the meta data for beautifying effects.

![datasheet](./step1-5.jpg)

Later, we will merge this data set with the similarity data that we will obtain from Rekognition.


## Step 2: Generate the data frame
First, we write a function that extract that similarity statistic through Amazon Rekognition

In [12]:
import boto3
import numpy as np
import pandas as pd

In [3]:
client = boto3.client('rekognition')

In [4]:
#check buckets
!aws s3 ls

2021-03-30 13:43:49 my-bucket-francis
2021-04-06 13:36:45 project-francis
2021-04-22 07:11:48 source-image-marinara
2021-04-22 07:15:54 target-image-marinara


In [5]:
#because we have 4 source images, the easier solution is to create 4 extraction functions with different source images.
def extract_similarity_FL(photo):
    try:
        comparison = client.compare_faces(
            SourceImage= {'S3Object':{'Bucket':'source-image-marinara', 'Name':'FL-source.jpg'}},
            TargetImage = {'S3Object':{'Bucket':'target-image-marinara','Name':photo}})
        similarity = comparison['FaceMatches'][0]['Similarity']
    except Exception:
        similarity = np.nan
    return similarity

def extract_similarity_Jenny(photo):
    try:
        comparison = client.compare_faces(
            SourceImage= {'S3Object':{'Bucket':'source-image-marinara', 'Name':'Jenny-source.jpg'}},
            TargetImage = {'S3Object':{'Bucket':'target-image-marinara','Name':photo}})
        similarity = comparison['FaceMatches'][0]['Similarity']
    except Exception:
        similarity = np.nan
    return similarity

def extract_similarity_Jin(photo):
    try:
        comparison = client.compare_faces(
            SourceImage= {'S3Object':{'Bucket':'source-image-marinara', 'Name':'Jin-source.jpg'}},
            TargetImage = {'S3Object':{'Bucket':'target-image-marinara','Name':photo}})
        similarity = comparison['FaceMatches'][0]['Similarity']
    except Exception:
        similarity = np.nan
    return similarity

def extract_similarity_Nicole(photo):
    try:
        comparison = client.compare_faces(
            SourceImage= {'S3Object':{'Bucket':'source-image-marinara', 'Name':'Nicole-source.jpg'}},
            TargetImage = {'S3Object':{'Bucket':'target-image-marinara','Name':photo}})
        similarity = comparison['FaceMatches'][0]['Similarity']
    except Exception:
        similarity = np.nan
    return similarity

In [6]:
#test
extract_similarity_FL('FL1.jpg')

99.99996185302734

In [7]:
extract_similarity_Jenny('Jenny1.JPG')

99.99998474121094

In [8]:
extract_similarity_Jin('Jin1.JPG')

99.97551727294922

In [9]:
extract_similarity_Nicole('Nicole4.JPG')

99.99911499023438

Now we genrate the data frame

In [10]:
s3_resource = boto3.resource('s3')
my_bucket = s3_resource.Bucket('target-image-marinara')
summaries = my_bucket.objects.all()
image_names = [image.key for image  in summaries]
image_names

['FL1.jpg',
 'FL10.jpg',
 'FL2.jpg',
 'FL3.jpg',
 'FL4.jpg',
 'FL5.jpg',
 'FL6.jpg',
 'FL7.jpg',
 'FL8.jpg',
 'FL9.jpg',
 'Jenny1.JPG',
 'Jenny10.JPG',
 'Jenny2.JPG',
 'Jenny3.JPG',
 'Jenny4.JPG',
 'Jenny5.JPG',
 'Jenny6.JPG',
 'Jenny7.JPG',
 'Jenny8.JPG',
 'Jenny9.JPG',
 'Jin1.JPG',
 'Jin10.JPG',
 'Jin2.JPG',
 'Jin3.JPG',
 'Jin4.JPG',
 'Jin5.JPG',
 'Jin6.JPG',
 'Jin7.JPG',
 'Jin8.JPG',
 'Jin9.JPG',
 'Nicole1.JPG',
 'Nicole10.JPG',
 'Nicole2.JPG',
 'Nicole3.JPG',
 'Nicole4.JPG',
 'Nicole5.JPG',
 'Nicole6.JPG',
 'Nicole7.JPG',
 'Nicole8.JPG',
 'Nicole9.JPG']

As we can see, s3 bucket orders the name differently from our manually typed in data sheet, so we need to make adjustments accordingly.

In [86]:
similarity = []*40

for i in range(0,10,1):
    similarity.append(extract_similarity_FL(image_names[i]))

[99.99996185302734, 99.99999237060547, 99.99917602539062, 99.9999771118164, 99.99993133544922, 99.99998474121094, 99.99998474121094, 99.99993896484375, 99.99996948242188, 99.99993896484375]


In [88]:
for i in range(10,20,1):
    similarity.append(extract_similarity_Jenny(image_names[i]))

In [92]:
for i in range(20,30,1):
    similarity.append(extract_similarity_Jin(image_names[i]))

In [96]:
for i in range(30,40,1):
    similarity.append(extract_similarity_Nicole(image_names[i]))

In [109]:
len(similarity)

40

In [110]:
df = pd.read_csv('data_sheet.csv')

In [111]:
df['similarity']=similarity

In [112]:
df

Unnamed: 0,Name,Pic#,forehead,chin,eyes,cheek,male,similarity
0,FL1.jpg,1,16,-33,49,26,1,99.999962
1,FL10.jpg,10,-16,19,83,57,1,99.999992
2,FL2.jpg,2,-50,38,100,100,1,99.999176
3,FL3.jpg,3,32,-24,71,39,1,99.999977
4,FL4.jpg,4,50,19,15,70,1,99.999931
5,FL5.jpg,5,-31,24,34,43,1,99.999985
6,FL6.jpg,6,22,-41,69,17,1,99.999985
7,FL7.jpg,7,8,-43,10,93,1,99.999939
8,FL8.jpg,8,-22,44,78,64,1,99.999969
9,FL9.jpg,9,43,-36,8,88,1,99.999939


In [115]:
#random checks if it the data frame is correct

extract_similarity_Jenny('Jenny10.JPG')

99.99986267089844

## Step 3: Run multiple linear regression
This is the multiple linear model that we are trying to estimate

$$similarity=\beta_0+\beta_1forehead+\beta_2chin+\beta_3eyes+\beta_4cheek+\beta_5male+u$$

In [116]:
import statsmodels.formula.api as smf


In [117]:
results = smf.ols('similarity ~ forehead+chin+eyes+cheek+male', data=df).fit()

In [119]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:             similarity   R-squared:                       0.301
Model:                            OLS   Adj. R-squared:                  0.199
Method:                 Least Squares   F-statistic:                     2.933
Date:                Thu, 22 Apr 2021   Prob (F-statistic):             0.0263
Time:                        12:48:12   Log-Likelihood:                 83.810
No. Observations:                  40   AIC:                            -155.6
Df Residuals:                      34   BIC:                            -145.5
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    100.0280      0.014   7261.956      0.0

Printing out the regression summary gives us a lot of information. At first glance, all regressors have negative effects on similarity. This is in line with our expectations. However, among all the slope estimates, only `forehead` obatains statistical significance at the 5% significance level, with a p-value of 0.034. The 95% confidence intervals are presented next to the p-values. The only 95% confidence interval to not include the null is `forehead`. While the other 4 variables do not obatin statistical significance, the F-statistic informs us that all regressors do have an effect on similarity jointly. The F-test rejects the null hypothesis at 5% significance level. Finally, these regressors do not explain well the similarity because the adjusted R-squared is only 20%.

Interestingly, while holding other factors fixed, the difference in similarity between male and female is -0.0169, although this regressor does not obatain statistical significance.

To sum up, we only find that beautifying effect on the forehead has a statiscally significant effect on similarity. For 1 unit increase in the beautifying effect on forehead, the similarity index decreases by 0.0003.