---
# Data Science and Artificial Intelliegence Practicum
## 3.1-modul. Data Wrangling
---

## 3.1.1 - Working with Files

In [29]:
import pandas as pd
import numpy as np

### Reading files

`read_pickle` - Load pickled pandas object (or any object) from file

`read_table` - Read general delimited file into DataFrame

`read_csv` - Read a comma-separated values (csv) file into DataFrame

`read_clipboard` - Read text from clipboard and pass to read_csv

`read_excel` - Read an Excel file into a pandas DataFrame

`read_json` - Convert a JSON string to pandas object

`read_html` - Read HTML tables into a list of DataFrame objects

`read_hdf` - Read from the store, close it if we opened it

`read_sql` - Read SQL query or database table into a DataFrame

#### `names` parameter
List of column names to use. If the file contains a header row, then you should explicitly pass `header=0` to override the column names. Duplicates in this list are not allowed.

In [30]:
pd.read_csv("sample_data/california_housing_test.csv", 
            names=[1,2,3,4,6,7,8,9])

Unnamed: 0,1,2,3,4,6,7,8,9
longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
-122.050000,37.370000,27.000000,3885.000000,661.000000,1537.000000,606.000000,6.608500,344700.000000
-118.300000,34.260000,43.000000,1510.000000,310.000000,809.000000,277.000000,3.599000,176500.000000
-117.810000,33.780000,27.000000,3589.000000,507.000000,1484.000000,495.000000,5.793400,270500.000000
-118.360000,33.820000,28.000000,67.000000,15.000000,49.000000,11.000000,6.135900,330000.000000
...,...,...,...,...,...,...,...,...
-119.860000,34.420000,23.000000,1450.000000,642.000000,1258.000000,607.000000,1.179000,225000.000000
-118.140000,34.060000,27.000000,5257.000000,1082.000000,3496.000000,1036.000000,3.390600,237200.000000
-119.700000,36.300000,10.000000,956.000000,201.000000,693.000000,220.000000,2.289500,62000.000000
-117.120000,34.100000,40.000000,96.000000,14.000000,46.000000,14.000000,3.270800,162500.000000


#### `header` parameter
Row number(s) to use as the column names, and the start of the data. If column names are passed explicitly then the behavior is identical to `header=None`. Explicitly pass `header=0` to be able to replace existing names.

In [31]:
pd.read_csv("sample_data/california_housing_test.csv", 
            names=[1,2,3,4,6,7,8,9], header=0)

Unnamed: 0,1,2,3,4,6,7,8,9
-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
-118.30,34.26,43.0,1510.0,310.0,809.0,277.0,3.5990,176500.0
-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0
...,...,...,...,...,...,...,...,...
-119.86,34.42,23.0,1450.0,642.0,1258.0,607.0,1.1790,225000.0
-118.14,34.06,27.0,5257.0,1082.0,3496.0,1036.0,3.3906,237200.0
-119.70,36.30,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
-117.12,34.10,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0


#### `index_col` parameter
Column(s) to use as the row labels of the DataFrame, either given as string name or column index.

In [32]:
df = pd.read_csv("https://raw.githubusercontent.com/anvarnarz/praktikum_datasets/main/uzbekistan.csv",
            names=["hudud", "maydon", "aholi"], index_col=0)
df

Unnamed: 0_level_0,maydon,aholi
hudud,Unnamed: 1_level_1,Unnamed: 2_level_1
Boʻlinishi,Maydoni (kv.km),Aholisi
Andijon viloyati,4200.00,1899000.00
Buxoro viloyati,39400.00,1384700.00
Fargʻona viloyati,6800.00,2597000.00
Jizzax viloyati,20500.00,910500.00
Xorazm viloyati,6300.00,1200000.00
Namangan viloyati,7900.00,1862000.00
Navoiy viloyati,110800.00,767500.00
Qashqadaryo viloyati,28400.00,2029000.00
Qoraqalpogʻiston Respublikasi,160000.00,1200000.00


In [33]:
pd.read_csv("https://raw.githubusercontent.com/anvarnarz/praktikum_datasets/main/uzbekistan.csv",
            index_col="Aholisi")

Unnamed: 0_level_0,Boʻlinishi,Maydoni (kv.km)
Aholisi,Unnamed: 1_level_1,Unnamed: 2_level_1
1899000.0,Andijon viloyati,4200.0
1384700.0,Buxoro viloyati,39400.0
2597000.0,Fargʻona viloyati,6800.0
910500.0,Jizzax viloyati,20500.0
1200000.0,Xorazm viloyati,6300.0
1862000.0,Namangan viloyati,7900.0
767500.0,Navoiy viloyati,110800.0
2029000.0,Qashqadaryo viloyati,28400.0
1200000.0,Qoraqalpogʻiston Respublikasi,160000.0
2322000.0,Samarqand viloyati,16400.0


#### `sep` parameter
Delimiter to use. Default ","

In [34]:
pd.read_csv('https://raw.githubusercontent.com/anvarnarz/praktikum_datasets/main/automobile_data.csv',
            sep=",").head()

Unnamed: 0,index,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
0,0,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,13495.0
1,1,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,16500.0
2,2,alfa-romero,hatchback,94.5,171.2,ohcv,six,154,19,16500.0
3,3,audi,sedan,99.8,176.6,ohc,four,102,24,13950.0
4,4,audi,sedan,99.4,176.6,ohc,five,115,18,17450.0


#### `nrows` parameter
Number of rows of file to read. Useful for reading pieces of large files.

In [35]:
pd.read_csv('https://raw.githubusercontent.com/anvarnarz/praktikum_datasets/main/global_earthquake.csv',
            index_col=0, nrows=15)

Unnamed: 0_level_0,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,Magnitude Error,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
01/02/1965,13:44:18,19.246,145.616,Earthquake,131.6,,,6.0,MW,,,,,,,ISCGEM860706,ISCGEM,ISCGEM,ISCGEM,Automatic
01/04/1965,11:29:49,1.863,127.352,Earthquake,80.0,,,5.8,MW,,,,,,,ISCGEM860737,ISCGEM,ISCGEM,ISCGEM,Automatic
01/05/1965,18:05:58,-20.579,-173.972,Earthquake,20.0,,,6.2,MW,,,,,,,ISCGEM860762,ISCGEM,ISCGEM,ISCGEM,Automatic
01/08/1965,18:49:43,-59.076,-23.557,Earthquake,15.0,,,5.8,MW,,,,,,,ISCGEM860856,ISCGEM,ISCGEM,ISCGEM,Automatic
01/09/1965,13:32:50,11.938,126.427,Earthquake,15.0,,,5.8,MW,,,,,,,ISCGEM860890,ISCGEM,ISCGEM,ISCGEM,Automatic
01/10/1965,13:36:32,-13.405,166.629,Earthquake,35.0,,,6.7,MW,,,,,,,ISCGEM860922,ISCGEM,ISCGEM,ISCGEM,Automatic
01/12/1965,13:32:25,27.357,87.867,Earthquake,20.0,,,5.9,MW,,,,,,,ISCGEM861007,ISCGEM,ISCGEM,ISCGEM,Automatic
01/15/1965,23:17:42,-13.309,166.212,Earthquake,35.0,,,6.0,MW,,,,,,,ISCGEM861111,ISCGEM,ISCGEM,ISCGEM,Automatic
01/16/1965,11:32:37,-56.452,-27.043,Earthquake,95.0,,,6.0,MW,,,,,,,ISCGEMSUP861125,ISCGEMSUP,ISCGEM,ISCGEM,Automatic
01/17/1965,10:43:17,-24.563,178.487,Earthquake,565.0,,,5.8,MW,,,,,,,ISCGEM861148,ISCGEM,ISCGEM,ISCGEM,Automatic


### Writing files

In [36]:
df

Unnamed: 0_level_0,maydon,aholi
hudud,Unnamed: 1_level_1,Unnamed: 2_level_1
Boʻlinishi,Maydoni (kv.km),Aholisi
Andijon viloyati,4200.00,1899000.00
Buxoro viloyati,39400.00,1384700.00
Fargʻona viloyati,6800.00,2597000.00
Jizzax viloyati,20500.00,910500.00
Xorazm viloyati,6300.00,1200000.00
Namangan viloyati,7900.00,1862000.00
Navoiy viloyati,110800.00,767500.00
Qashqadaryo viloyati,28400.00,2029000.00
Qoraqalpogʻiston Respublikasi,160000.00,1200000.00


In [37]:
# write object to a comma-separated values (csv) file
df.to_csv("uzb_data.csv")

In [38]:
ser_obj = df.maydon
ser_obj

hudud
Boʻlinishi                       Maydoni (kv.km)
Andijon viloyati                         4200.00
Buxoro viloyati                         39400.00
Fargʻona viloyati                        6800.00
Jizzax viloyati                         20500.00
Xorazm viloyati                          6300.00
Namangan viloyati                        7900.00
Navoiy viloyati                        110800.00
Qashqadaryo viloyati                    28400.00
Qoraqalpogʻiston Respublikasi          160000.00
Samarqand viloyati                      16400.00
Sirdaryo viloyati                        5100.00
Surxondaryo viloyati                    20800.00
Toshkent viloyati                       15300.00
Name: maydon, dtype: object

In [39]:
ser_obj.to_csv("uzb_maydon.csv")

#### `squeeze`
If the parsed data only contains one column then return a Series

In [40]:
pd.read_csv("uzb_maydon.csv", squeeze=True, index_col=0)

hudud
Boʻlinishi                       Maydoni (kv.km)
Andijon viloyati                         4200.00
Buxoro viloyati                         39400.00
Fargʻona viloyati                        6800.00
Jizzax viloyati                         20500.00
Xorazm viloyati                          6300.00
Namangan viloyati                        7900.00
Navoiy viloyati                        110800.00
Qashqadaryo viloyati                    28400.00
Qoraqalpogʻiston Respublikasi          160000.00
Samarqand viloyati                      16400.00
Sirdaryo viloyati                        5100.00
Surxondaryo viloyati                    20800.00
Toshkent viloyati                       15300.00
Name: maydon, dtype: object

In [41]:
df2 = pd.read_csv('https://raw.githubusercontent.com/anvarnarz/praktikum_datasets/main/automobile_data.csv',
                  index_col=0)
df2

Unnamed: 0_level_0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,13495.0
1,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,16500.0
2,alfa-romero,hatchback,94.5,171.2,ohcv,six,154,19,16500.0
3,audi,sedan,99.8,176.6,ohc,four,102,24,13950.0
4,audi,sedan,99.4,176.6,ohc,five,115,18,17450.0
...,...,...,...,...,...,...,...,...,...
81,volkswagen,sedan,97.3,171.7,ohc,four,85,27,7975.0
82,volkswagen,sedan,97.3,171.7,ohc,four,52,37,7995.0
86,volkswagen,sedan,97.3,171.7,ohc,four,100,26,9995.0
87,volvo,sedan,104.3,188.8,ohc,four,114,23,12940.0


#### drop index

In [42]:
df2.to_excel("auto_data.xlsx", index=False)

In [43]:
pd.read_excel("auto_data.xlsx")

Unnamed: 0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
0,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,13495.0
1,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,16500.0
2,alfa-romero,hatchback,94.5,171.2,ohcv,six,154,19,16500.0
3,audi,sedan,99.8,176.6,ohc,four,102,24,13950.0
4,audi,sedan,99.4,176.6,ohc,five,115,18,17450.0
...,...,...,...,...,...,...,...,...,...
56,volkswagen,sedan,97.3,171.7,ohc,four,85,27,7975.0
57,volkswagen,sedan,97.3,171.7,ohc,four,52,37,7995.0
58,volkswagen,sedan,97.3,171.7,ohc,four,100,26,9995.0
59,volvo,sedan,104.3,188.8,ohc,four,114,23,12940.0


#### write only certain columns

In [44]:
df2.to_excel("auto_data.xlsx",
             index=False,
             columns=['company','body-style','engine-type','horsepower','price']
             )

In [45]:
pd.read_excel("auto_data.xlsx")

Unnamed: 0,company,body-style,engine-type,horsepower,price
0,alfa-romero,convertible,dohc,111,13495.0
1,alfa-romero,convertible,dohc,111,16500.0
2,alfa-romero,hatchback,ohcv,154,16500.0
3,audi,sedan,ohc,102,13950.0
4,audi,sedan,ohc,115,17450.0
...,...,...,...,...,...
56,volkswagen,sedan,ohc,85,7975.0
57,volkswagen,sedan,ohc,52,7995.0
58,volkswagen,sedan,ohc,100,9995.0
59,volvo,sedan,ohc,114,12940.0


### HDF5 format
Hierarchical Data Format (HDF)

#### `pandas.HDFStore`

In [46]:
hdfobj = pd.HDFStore("data.h5")

add data to hdf object

In [47]:
hdfobj['uzdata'] = df

In [48]:
hdfobj['cardata'] = df2

We can call elements like dictionary in hdf object

In [49]:
hdfobj['uzdata']

Unnamed: 0_level_0,maydon,aholi
hudud,Unnamed: 1_level_1,Unnamed: 2_level_1
Boʻlinishi,Maydoni (kv.km),Aholisi
Andijon viloyati,4200.00,1899000.00
Buxoro viloyati,39400.00,1384700.00
Fargʻona viloyati,6800.00,2597000.00
Jizzax viloyati,20500.00,910500.00
Xorazm viloyati,6300.00,1200000.00
Namangan viloyati,7900.00,1862000.00
Navoiy viloyati,110800.00,767500.00
Qashqadaryo viloyati,28400.00,2029000.00
Qoraqalpogʻiston Respublikasi,160000.00,1200000.00


After saving data, we should `close()` hdf object

In [50]:
hdfobj.close()

#### `pandas.to_hdf` method
Write the contained data to an HDF5 file using HDFStore

In [51]:
df.to_hdf("uzdata.h5", key='uzdata')

For reading HDF5 files we can use either `HDFStore` or `read_hdf` method

In [52]:
data = pd.HDFStore("data.h5", mode='r')

In [53]:
# return a list of keys corresponding to objects stored in HDFStore
data.keys()

['/cardata', '/uzdata']

In [54]:
data['uzdata'].head()

Unnamed: 0_level_0,maydon,aholi
hudud,Unnamed: 1_level_1,Unnamed: 2_level_1
Boʻlinishi,Maydoni (kv.km),Aholisi
Andijon viloyati,4200.00,1899000.00
Buxoro viloyati,39400.00,1384700.00
Fargʻona viloyati,6800.00,2597000.00
Jizzax viloyati,20500.00,910500.00


In [55]:
data.close()

In [56]:
pd.read_hdf('uzdata.h5', key='uzdata')

Unnamed: 0_level_0,maydon,aholi
hudud,Unnamed: 1_level_1,Unnamed: 2_level_1
Boʻlinishi,Maydoni (kv.km),Aholisi
Andijon viloyati,4200.00,1899000.00
Buxoro viloyati,39400.00,1384700.00
Fargʻona viloyati,6800.00,2597000.00
Jizzax viloyati,20500.00,910500.00
Xorazm viloyati,6300.00,1200000.00
Namangan viloyati,7900.00,1862000.00
Navoiy viloyati,110800.00,767500.00
Qashqadaryo viloyati,28400.00,2029000.00
Qoraqalpogʻiston Respublikasi,160000.00,1200000.00
