Import niezbędnych bibliotek

In [1]:
import pandas as pd

from sqlalchemy import create_engine

import sys
sys.path.insert(1, '../../.')
from src import config

Tworzenie połączenia z lokalną bazą danych

In [2]:
db_connection = create_engine(config.database['db_con_str'])

## Pobranie i oczyszczenie danych projektów

In [3]:
projects = pd.read_sql('projects', con=db_connection)

### Informacje o zbiorze

In [4]:
projects.shape

(108718, 10)

In [5]:
projects.dtypes

id                      int64
url                    object
owner_id                int64
name                   object
description            object
language               object
created_at     datetime64[ns]
ext_ref_id             object
forked_from           float64
deleted                 int64
dtype: object

In [6]:
projects.describe(include=['object'])

Unnamed: 0,url,name,description,language,ext_ref_id
count,108718,108718,108718,108616,108718
unique,108710,1205,1239,19,108710
top,https://api.github.com/repos/vjovanov/akka,homebrew,Ruby on Rails,JavaScript,5237c11dbd3543c20b0025d4
freq,2,6784,6434,29526,2


In [7]:
projects.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108718 entries, 0 to 108717
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   id           108718 non-null  int64         
 1   url          108718 non-null  object        
 2   owner_id     108718 non-null  int64         
 3   name         108718 non-null  object        
 4   description  108718 non-null  object        
 5   language     108616 non-null  object        
 6   created_at   108718 non-null  datetime64[ns]
 7   ext_ref_id   108718 non-null  object        
 8   forked_from  108627 non-null  float64       
 9   deleted      108718 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(5)
memory usage: 8.3+ MB


In [8]:
projects.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108718 entries, 0 to 108717
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   id           108718 non-null  int64         
 1   url          108718 non-null  object        
 2   owner_id     108718 non-null  int64         
 3   name         108718 non-null  object        
 4   description  108718 non-null  object        
 5   language     108616 non-null  object        
 6   created_at   108718 non-null  datetime64[ns]
 7   ext_ref_id   108718 non-null  object        
 8   forked_from  108627 non-null  float64       
 9   deleted      108718 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(5)
memory usage: 49.1 MB


In [9]:
projects.memory_usage(deep=True)

Index               128
id               869744
url            11256601
owner_id         869744
name            7065841
description    13224754
language        6789744
created_at       869744
ext_ref_id      8806158
forked_from      869744
deleted          869744
dtype: int64

In [10]:
projects.head()

Unnamed: 0,id,url,owner_id,name,description,language,created_at,ext_ref_id,forked_from,deleted
0,1,https://api.github.com/repos/akka/akka,1,akka,Akka Project,Scala,2009-02-16 12:51:54,52343e2ebd3543bb7f000002,,0
1,2,https://api.github.com/repos/hadley/devtools,2,devtools,Tools to make an R developer's life easier,R,2010-05-03 04:08:49,52343eecbd35436ddb000002,,0
2,3,https://api.github.com/repos/johnmyleswhite/Pr...,3,ProjectTemplate,A template utility for R projects that provide...,R,2010-08-24 17:22:36,52343eecbd35436de8000002,,0
3,4,https://api.github.com/repos/mavam/stat-cookbook,6,stat-cookbook,The probability and statistics cookbook,R,2012-04-23 20:24:37,52343eecbd35436de3000002,,0
4,5,https://api.github.com/repos/facebook/hiphop-php,8,hiphop-php,"Virtual Machine, Runtime, and JIT for PHP",C++,2010-01-02 01:17:06,52343eecbd35436dee000002,,0


### Usuwanie zbędnych danych

In [11]:
projects[projects.deleted != 0].shape

(0, 10)

Skoro wszystkie wiersze mają tę samą wartość w kolumnie `deleted` (ustawioną na `0`), można bez straty informacji tę kolumnę usunąć. Usunwam również inne kolumny, nie mające znaczenia dla analizy.

In [12]:
projects.drop(['deleted', 'ext_ref_id', 'url', 'owner_id', 'description', 'forked_from'], axis=1, inplace=True)

In [13]:
projects.head()

Unnamed: 0,id,name,language,created_at
0,1,akka,Scala,2009-02-16 12:51:54
1,2,devtools,R,2010-05-03 04:08:49
2,3,ProjectTemplate,R,2010-08-24 17:22:36
3,4,stat-cookbook,R,2012-04-23 20:24:37
4,5,hiphop-php,C++,2010-01-02 01:17:06


In [14]:
projects.shape

(108718, 4)

Usunięcie wierszy z brakującymi danymi, na podstawie których nie będzie można prowadzić sensownych analiz

In [15]:
projects.isnull().sum()

id              0
name            0
language      102
created_at      0
dtype: int64

In [16]:
projects.dropna(subset=['language'], how='any', inplace=True)

In [17]:
projects.shape

(108616, 4)

In [18]:
projects.isnull().sum()

id            0
name          0
language      0
created_at    0
dtype: int64

### Zmiana typów
Domyślnie dane przechowywane napisów przechowywane są jako `object`. Jednak, jeśli jest ograniczona ilość napisów, tak jak w przypadku kolumny `language`, można zmienić typ na `category` optymalizując pamięć.

In [19]:
projects.dtypes

id                     int64
name                  object
language              object
created_at    datetime64[ns]
dtype: object

In [20]:
projects.memory_usage(deep=True)

Index          868928
id             868928
name          7059027
language      6787296
created_at     868928
dtype: int64

In [21]:
projects.language = projects.language.astype('category')
projects.created_at = projects.created_at.astype('datetime64[ns]')

### Eksport do pliku `.pkl` wstępnie oczyszczonych danych projektów
> UWAGA! lepiej wczytywać dane do `.pkl`, a nie do `.csv`, ponieważ w tym wypadku zapisujemy całą strukturę, nie tracimy informacji o typach i nie mamy problemu z indeksami. Sam plik zajmuje również zdecydowanie mniej pamięci.

In [22]:
projects.to_pickle('../../data/01_data_from_db/projects.pkl')