# Analysis of Image Sizes data

Check consistency and accuracy of created size column

## Import necessary libraries

In [72]:
import re

import pandas as pd

## Load data from Excel file

In [78]:
df = pd.read_excel("../data/ImageSizeExample.xlsx")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46888 entries, 0 to 46887
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   image_url  46868 non-null  object
 1   SIZE       46888 non-null  object
dtypes: object(2)
memory usage: 732.8+ KB


In [79]:
sizes = df["SIZE"]

sizes

0          575x860
1        1080x1080
2        1080x1614
3        1080x1080
4        1080x1614
           ...    
46883    2301x1080
46884    1621x1080
46885    1080x1218
46886    1080x1080
46887    1757x1080
Name: SIZE, Length: 46888, dtype: object

## Check valid sizes amount

In [80]:
size_format = re.compile(r"^\d+x\d+$")

valid_sizes_mask = sizes.apply(lambda x: bool(size_format.match(str(x))))

valid_sizes_amount = valid_sizes_mask.sum()

valid_sizes_amount

45872

In [93]:
valid_sizes_amount / sizes.size * 100

97.83313427742706

Valid sizes part of dataframe is equal to **97,83%**


Looks good!

## Check invalid sizes amount

In [81]:
invalid_sizes = sizes[~valid_sizes_mask]
invalid_sizes_amount = invalid_sizes.size

invalid_sizes_amount

1016

In [94]:
invalid_sizes_amount / sizes.size * 100

2.1668657225729397

Invalid sizes part of dataframe equal to **2,17%**

### Check invalid sizes values

In [103]:
invalid_sizes.unique()

array(['ImageNotFound', 'UrlNotProvided'], dtype=object)

In invalid sizes we've got ImageNotFound and UrlNotProvided

### Check ImageNotFound values amount (404 status code when retrieving image)

In [104]:
image_not_found_amount = invalid_sizes[invalid_sizes == "ImageNotFound"].count()

image_not_found_amount

996

In [105]:
image_not_found_amount / sizes.size * 100

2.1242108855144175

ImageNotFound part of DataFrame is **2,12%**

### Check UrlNotProvided values amount (no urls in corresonding rows)

In [106]:
url_not_provided_amount = invalid_sizes[invalid_sizes == "UrlNotProvided"].count()

url_not_provided_amount

20

In [107]:
url_not_provided_amount / sizes.size * 100

0.04265483705852244

A really small amount (**0.04%**) of urls for images were not provided

## Results:

**97,83% - Valid Image sizes**


2,17% - Invalid Image sizes

from which:
* **~2,12% ImageNotFound**

* **~0.04% UrlNotProvided**