# Sprint 04. Tasca 01
## By José Manuel Castaño

## - Exercici 1

Estandaritza, identifica i enumera cada un dels atributs / variables de l'estructura de l'arxiu "Web_access_log-akumenius.com" que trobaràs al repositori de GitHub "Data-sources". 

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

## Preparem la importació:
- Fem el separador dels diferents camps del log amb Regex
- Definim els noms de les columnes
- Eliminem els extrems d'alguns camps que tenen [] i ""
- Transformem en enters alguns valors numèrics
- Canviem el - per un NaN

In [17]:
separador = r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])'
nombre_columnas = ['remote_host', 'IP_client', 'user_identity', 'user_name', 'data_time', 'request', 'status', 'size', 'referer', 'user_agent', 'unknown']

#Funció que elimina els extrems d'alguns camps, [] i ""
def eliminar_extrems(x):
    return x[1:-1]            
convertidor = {'data_time':eliminar_extrems, 'request': eliminar_extrems, 'user_agent':eliminar_extrems, 'referer':eliminar_extrems, 'status':int, 'size':int}

pd.set_option('display.max_colwidth', None)
logs = pd.read_csv('Web_access_log-akumenius.com.txt', sep = separador, names = nombre_columnas, converters=convertidor, na_values='-', engine='python')
logs.head(60)

Unnamed: 0,remote_host,IP_client,user_identity,user_name,data_time,request,status,size,referer,user_agent,unknown
0,localhost,127.0.0.1,,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection),VLOG=-
1,localhost,127.0.0.1,,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection),VLOG=-
2,localhost,127.0.0.1,,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection),VLOG=-
3,localhost,127.0.0.1,,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection),VLOG=-
4,localhost,127.0.0.1,,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection),VLOG=-
5,localhost,127.0.0.1,,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection),VLOG=-
6,localhost,127.0.0.1,,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection),VLOG=-
7,localhost,127.0.0.1,,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection),VLOG=-
8,localhost,127.0.0.1,,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection),VLOG=-
9,localhost,127.0.0.1,,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection),VLOG=-


## - Exercici 2

Neteja, preprocesa, estructura i transforma (dataframe) les dades del registre d'Accés a la web. 

Analitzem el contingut dels camps per tal de detectar camps susceptibles de ser eliminats

In [3]:
logs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261873 entries, 0 to 261872
Data columns (total 11 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   remote_host    261873 non-null  object 
 1   IP_client      261873 non-null  object 
 2   user_identity  0 non-null       float64
 3   user_name      27 non-null      object 
 4   data_time      261873 non-null  object 
 5   request        261836 non-null  object 
 6   status         261873 non-null  int64  
 7   size           219538 non-null  object 
 8   referer        162326 non-null  object 
 9   user_agent     261654 non-null  object 
 10  unknown        261873 non-null  object 
dtypes: float64(1), int64(1), object(9)
memory usage: 22.0+ MB


In [4]:
logs.isnull().sum()

remote_host           0
IP_client             0
user_identity    261873
user_name        261846
data_time             0
request              37
status                0
size              42335
referer           99547
user_agent          219
unknown               0
dtype: int64

In [5]:
logs['unknown'].value_counts()

VLOG=-    261873
Name: unknown, dtype: int64

Veiem que el camp **user_identity** té tots el valors nulls i que el camp **unknown** tots els registres tenen el mateix valor.   
Decidim per tant eliminar-los

In [18]:
logs.drop(['user_identity', 'unknown'], axis=1, inplace=True)
logs.head()

Unnamed: 0,remote_host,IP_client,user_name,data_time,request,status,size,referer,user_agent
0,localhost,127.0.0.1,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection)
1,localhost,127.0.0.1,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection)
2,localhost,127.0.0.1,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection)
3,localhost,127.0.0.1,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection)
4,localhost,127.0.0.1,,23/Feb/2014:03:10:31 +0100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection)


Desdoblem el camp **data_time** en 3 nous camps: **date, time i time_zone**.     
Eliminem el camp **data_time**

In [19]:
logs.insert(4,'date',logs['data_time'].str.split(':',1).str[0])
logs.insert(5,'time',logs['data_time'].str.slice(-14,-5))
logs.insert(6, 'time_zone',logs['data_time'].str.slice(-5))  
logs.drop(['data_time'], axis=1, inplace=True)               #Eliminem el camp data_time
logs.head()

Unnamed: 0,remote_host,IP_client,user_name,date,time,time_zone,request,status,size,referer,user_agent
0,localhost,127.0.0.1,,23/Feb/2014,03:10:31,100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection)
1,localhost,127.0.0.1,,23/Feb/2014,03:10:31,100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection)
2,localhost,127.0.0.1,,23/Feb/2014,03:10:31,100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection)
3,localhost,127.0.0.1,,23/Feb/2014,03:10:31,100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection)
4,localhost,127.0.0.1,,23/Feb/2014,03:10:31,100,OPTIONS * HTTP/1.0,200,,,Apache (internal dummy connection)


Desdoblem el camp **request** en 3 camps: **type_request, request i protocol**

In [20]:
logs.insert(6,'type_request',logs['request'].str.split(' ').str[0])
logs.insert(8,'protocol',logs['request'].str.split(' ').str[2])
logs['request']=logs['request'].str.split(' ').str[1]
logs.head()

Unnamed: 0,remote_host,IP_client,user_name,date,time,time_zone,type_request,request,protocol,status,size,referer,user_agent
0,localhost,127.0.0.1,,23/Feb/2014,03:10:31,100,OPTIONS,*,HTTP/1.0,200,,,Apache (internal dummy connection)
1,localhost,127.0.0.1,,23/Feb/2014,03:10:31,100,OPTIONS,*,HTTP/1.0,200,,,Apache (internal dummy connection)
2,localhost,127.0.0.1,,23/Feb/2014,03:10:31,100,OPTIONS,*,HTTP/1.0,200,,,Apache (internal dummy connection)
3,localhost,127.0.0.1,,23/Feb/2014,03:10:31,100,OPTIONS,*,HTTP/1.0,200,,,Apache (internal dummy connection)
4,localhost,127.0.0.1,,23/Feb/2014,03:10:31,100,OPTIONS,*,HTTP/1.0,200,,,Apache (internal dummy connection)


## - Exercici 3

Geolocalitza les IP's.

Per geolocalitzar utilitzem el mòdul ip2geotools.  
Busquem la freqüencia de IP's i en funció de la quantitat de ip's diferents que ens surtin, establirem un criteri per tal de geolocalitzar només les més repetides, ja que el procés de geolocalitzar és lent

In [21]:
from ip2geotools.databases.noncommercial import DbIpCity

In [22]:
logs['IP_client'].value_counts()

66.249.76.216      46382
80.28.221.123      14725
127.0.0.1          13892
217.125.71.222      5201
66.249.75.148       3558
                   ...  
84.123.150.27          1
217.130.150.116        1
202.46.52.23           1
216.151.130.170        1
206.198.5.33           1
Name: IP_client, Length: 2921, dtype: int64

Creem un nou df **Ip_procedencia** on tindrem les IP's, les visites i les dades de geolocalització

In [23]:
IP_procedencia=logs['IP_client'].value_counts().rename_axis('IP').reset_index(name='visits')
IP_procedencia

Unnamed: 0,IP,visits
0,66.249.76.216,46382
1,80.28.221.123,14725
2,127.0.0.1,13892
3,217.125.71.222,5201
4,66.249.75.148,3558
...,...,...
2916,84.123.150.27,1
2917,217.130.150.116,1
2918,202.46.52.23,1
2919,216.151.130.170,1


In [24]:
IP_procedencia.describe().round(2)

Unnamed: 0,visits
count,2921.0
mean,89.65
std,949.37
min,1.0
25%,2.0
50%,31.0
75%,81.0
max,46382.0


Veiem que en total tenim 2921 visites úniques. La mitja de visites per IP es de 89.65 i que 81 IP's realitzen el 75% del total de visites (3º cuartil).
Com que el procès de geolocalització és lent, ens quedarem només amb les 81 ip's amb més visites i són les que geolocalitzarem

In [25]:
IP_procedencia.drop(range(82,2921), axis=0, inplace=True)
IP_procedencia

Unnamed: 0,IP,visits
0,66.249.76.216,46382
1,80.28.221.123,14725
2,127.0.0.1,13892
3,217.125.71.222,5201
4,66.249.75.148,3558
...,...,...
77,208.43.225.85,313
78,83.43.71.59,310
79,85.55.167.1,309
80,193.111.141.52,301


Utilitzem la llibreria ip2geotools per tal de geolocalitzar les Ip's

In [26]:
#Creem diferents funcions que retornen latitude, longitude, city, region, country en funció del paàmetre ip
def latitude(ip):
    try:
        g=DbIpCity.get(ip,api_key='free')
        return g.latitude
    except:
        return np.nan

def longitude(ip):
    try:
        g=DbIpCity.get(ip,api_key='free')
        return g.longitude
    except:
        return np.nan   

def city(ip):
    try:
        g=DbIpCity.get(ip,api_key='free')
        return g.city
    except:
        return np.nan
    
def region(ip):
    try:
        g=DbIpCity.get(ip,api_key='free')
        return g.region
    except:
        return np.nan 

def country(ip):
    try:
        g=DbIpCity.get(ip,api_key='free')
        return g.country
    except:
        return np.nan
    
latitude('176.31.255.177')

50.6915893

Creem els nous camps amb la informació de geolocalització

In [28]:
IP_procedencia['latitude'] = IP_procedencia['IP'].apply(latitude)
IP_procedencia['longitude'] = IP_procedencia['IP'].apply(longitude)
IP_procedencia['city'] = IP_procedencia['IP'].apply(city)
IP_procedencia['region'] = IP_procedencia['IP'].apply(region)
IP_procedencia['country'] = IP_procedencia['IP'].apply(country)
IP_procedencia

Unnamed: 0,IP,visits,latitude,longitude,city,region,country
0,66.249.76.216,46382,37.389389,-122.083210,Mountain View,California,US
1,80.28.221.123,14725,40.416705,-3.703582,Madrid,Madrid,ES
2,127.0.0.1,13892,36.733438,-119.833235,,,ZZ
3,217.125.71.222,5201,40.416705,-3.703582,Madrid,Madrid,ES
4,66.249.75.148,3558,37.389389,-122.083210,Mountain View,California,US
...,...,...,...,...,...,...,...
77,208.43.225.85,313,38.895037,-77.036543,Washington D.C.,District of Columbia,US
78,83.43.71.59,310,40.416705,-3.703582,Madrid,Madrid,ES
79,85.55.167.1,309,40.434653,-3.814834,Pozuelo de Alarcón,Madrid,ES
80,193.111.141.52,301,,,Amsterdam (Nieuwmarkt en Lastage),North Holland,NL


In [None]:
Mirem de quins contries provenen

In [41]:
IP_procedencia.groupby(['country'])['visits'].agg(sum).sort_values(ascending=False)

country
US    62500
ES    46937
ZZ    13892
SE     1821
FR     1044
IT      831
DE      552
VG      426
NL      301
Name: visits, dtype: int64