# Proyek Analisis Data: [Input Nama Dataset]
- **Nama:** Refanda Surya Saputra
- **Email:** refandasuryasaputra@gmail.com
- **ID Dicoding:** refan_surya

## Menentukan Pertanyaan Bisnis

- Bagaimana tren peminjaman sepeda pada musim semi hingga musim dingin di tahun 2011?
- Bagaimana demografi peminjam sepeda

## Import Semua Packages/Library yang Digunakan

In [242]:
from itertools import groupby
from webbrowser import register

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data Wrangling

### Gathering Data

In [243]:
day_df = pd.read_csv("data/day.csv")
day_df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


In [244]:
hour_df = pd.read_csv("data/hour.csv")
hour_df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


**Insight:**
- Terdapat 2 dataset yang mencatat jumlah sepeda yang disewa perjam dan perhari dari tahun 2011 hingga 2012. Kemudian, terdapat informasi tambahan yaitu keadaan cuaca dan musim
- Dalam dataset harian tidak terdapat fitur kolom hr (hour), tetapi pada dataset perjam tersedia. Dataset harian mencatat akumulasi dalam satu hari itu.
- Berikut ini adalah ringkasan tiap fitur dalam kedua dataset:
    - instant: index
    - dteday: tanggal
    - season: musim (1: spring, 2: summer, 3: fall, 4: winter)
    - yr: tahun (0: 2911, 1: 2012)
    - mnth: bulan (1 hingga 12)
    - holiday: hari libur atau tidak
    - weekday: hari dalam seminggu
    - workingday: jika bukan weekend atau holiday 1, selain itu 0
    - weathersit: Keadaan cuaca
        - 1: cerah, sedikit awan, dan awan sebagian
        - 2: kabut ringan + mendung, kabut ringan + sebagian besar mendung, kabut ringan + sedikit awan, kabut ringan
        - 3: salju ringan, hujan ringan + badao petir + sebagian cerah, hujan ringan + sebagian cerah
        - 4: hujan lebat + es + badai petir + kabut ringan, bersalju + berkabut
    - temp: suhu dalam celcius yang telah dinormalisasi
    - atemp: suhu yang terasa oleh tubuh dalam celcius dan dinormalisasi
    - hum: kelembaban yang dinormalisasi
    - windspeed: kecepatan angin yang dinormalisasi
    - casual: jumlah pengguna kasual
    - registered: jumlah pengguna yang terdaftar
    - cnt: jumlah penyewa sepeda termasuk kasual dan terdaftar


### Assessing Data

**Menilai Data Penyewaan Harian**

In [245]:
day_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    int64  
 3   yr          731 non-null    int64  
 4   mnth        731 non-null    int64  
 5   holiday     731 non-null    int64  
 6   weekday     731 non-null    int64  
 7   workingday  731 non-null    int64  
 8   weathersit  731 non-null    int64  
 9   temp        731 non-null    float64
 10  atemp       731 non-null    float64
 11  hum         731 non-null    float64
 12  windspeed   731 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 91.5+ KB


In [246]:
day_df.isna().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [247]:
print(f"Jumlah duplikasi: {day_df.duplicated().sum()}")

Jumlah duplikasi: 0


In [248]:
day_df.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


**Menilai Data Penyewaan Sepeda Perjam**

In [249]:
hour_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB


In [250]:
hour_df.isna().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [251]:
print(f"Jumlah duplikasi: {hour_df.duplicated().sum()}")

Jumlah duplikasi: 0


In [252]:
hour_df.describe()

Unnamed: 0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
std,5017.0295,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4345.5,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8690.0,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13034.5,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17379.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0


Berikut ini adalah rangkuman dari penilaian data yang akan digunakan:
| | Tipe Data | Missing Value | Duplicated Data | Inaccurate Value |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| day_df | Terdapat kesalahan tipe data pada kolom dteday | - | - | - |
| hour_df | Terdapat kesalahan tipe data pada kolom dteday | - | - | - |

**Insight:**
- Pada dataset hour_df dan day_df tidak terdapat nilai yang duplikasi
- Pada dataset hour_df dan day_df tidak terdapat missing value
- Pada dataset hour_df dan day_df terdapat kesalahan tipe data pada kolom dteday karena masih object, harusnya date
- Pada dataset hour_df dan day_df semua nilai akurat

### Cleaning Data

**Membersihkan Data Peminjaman Sepeda Harian**

In [253]:
day_df["dteday"] = pd.to_datetime(day_df["dteday"])

In [254]:
day_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   instant     731 non-null    int64         
 1   dteday      731 non-null    datetime64[ns]
 2   season      731 non-null    int64         
 3   yr          731 non-null    int64         
 4   mnth        731 non-null    int64         
 5   holiday     731 non-null    int64         
 6   weekday     731 non-null    int64         
 7   workingday  731 non-null    int64         
 8   weathersit  731 non-null    int64         
 9   temp        731 non-null    float64       
 10  atemp       731 non-null    float64       
 11  hum         731 non-null    float64       
 12  windspeed   731 non-null    float64       
 13  casual      731 non-null    int64         
 14  registered  731 non-null    int64         
 15  cnt         731 non-null    int64         
dtypes: datetime64[ns](1), floa

**Membersihkan Data Peminjaman Sepeda Perjam**

In [255]:
hour_df["dteday"] = pd.to_datetime(hour_df["dteday"])

In [256]:
hour_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   instant     17379 non-null  int64         
 1   dteday      17379 non-null  datetime64[ns]
 2   season      17379 non-null  int64         
 3   yr          17379 non-null  int64         
 4   mnth        17379 non-null  int64         
 5   hr          17379 non-null  int64         
 6   holiday     17379 non-null  int64         
 7   weekday     17379 non-null  int64         
 8   workingday  17379 non-null  int64         
 9   weathersit  17379 non-null  int64         
 10  temp        17379 non-null  float64       
 11  atemp       17379 non-null  float64       
 12  hum         17379 non-null  float64       
 13  windspeed   17379 non-null  float64       
 14  casual      17379 non-null  int64         
 15  registered  17379 non-null  int64         
 16  cnt         17379 non-

**Insight:**
- Pada data day_df dan hour_df tidak banyak hal yang perlu dibersihkan, hanya mengubah format kolom dteday dari object menjadi datetime
- Dari awal dataset ini sudah bagus untuk digunakan pelatihan model, karena tidak terdapat duplikasi data, missing value, data tidak akurat, sudah dinormalisasi, dan untuk data ketagorikal sudah dilakukan encoding seperti pada kolom season dan weathersit

## Exploratory Data Analysis (EDA)

### Explore ...

In [257]:
day_df.describe(include="all")

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2012-01-01 00:00:00,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
min,1.0,2011-01-01 00:00:00,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2011-07-02 12:00:00,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,2012-01-01 00:00:00,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,2012-07-01 12:00:00,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0
max,731.0,2012-12-31 00:00:00,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0
std,211.165812,,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452


In [258]:
hour_df.describe(include="all")

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17379.0,17379,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2012-01-02 04:08:34.552045568,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
min,1.0,2011-01-01 00:00:00,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4345.5,2011-07-04 00:00:00,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8690.0,2012-01-02 00:00:00,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13034.5,2012-07-02 00:00:00,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17379.0,2012-12-31 00:00:00,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0
std,5017.0295,,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599


In [259]:
day_df.groupby(by="season").agg({
    "cnt": "sum"
})

Unnamed: 0_level_0,cnt
season,Unnamed: 1_level_1
1,471348
2,918589
3,1061129
4,841613


In [260]:
day_df.groupby(by=["yr", "season"])["cnt"].sum()

yr  season
0   1         150000
    2         347316
    3         419650
    4         326137
1   1         321348
    2         571273
    3         641479
    4         515476
Name: cnt, dtype: int64

In [261]:
day_df.groupby(by=["yr", "mnth"])["cnt"].sum()

yr  mnth
0   1        38189
    2        48215
    3        64045
    4        94870
    5       135821
    6       143512
    7       141341
    8       136691
    9       127418
    10      123511
    11      102167
    12       87323
1   1        96744
    2       103137
    3       164875
    4       174224
    5       195865
    6       202830
    7       203607
    8       214503
    9       218573
    10      198841
    11      152664
    12      123713
Name: cnt, dtype: int64

In [262]:
workday_df = day_df.groupby(by="workingday")["cnt"].sum()
print(workday_df)

workingday
0    1000269
1    2292410
Name: cnt, dtype: int64


In [263]:
holiday_df = day_df.groupby(by="holiday")["cnt"].sum()
print(holiday_df)

holiday
0    3214244
1      78435
Name: cnt, dtype: int64


In [264]:
weekend_count = workday_df[0] - holiday_df[1]
print(f"Total penyewa sepeda di weekend: {weekend_count}")

Total penyewa sepeda di weekend: 921834


In [274]:
day_df.groupby(by="weathersit")["cnt"].sum()

weathersit
1    2257952
2     996858
3      37869
Name: cnt, dtype: int64

In [266]:
hour_df.groupby(by="weathersit")["cnt"].sum()

weathersit
1    2338173
2     795952
3     158331
4        223
Name: cnt, dtype: int64

In [267]:
day_df.groupby(by="yr")["casual"].sum()

yr
0    247252
1    372765
Name: casual, dtype: int64

In [268]:
day_df.groupby(by="yr")["registered"].sum()

yr
0     995851
1    1676811
Name: registered, dtype: int64

In [269]:
casual_df = day_df.groupby(by=["yr", "season"])["casual"].sum()
print(casual_df)

yr  season
0   1          21425
    2          77564
    3          95450
    4          52813
1   1          39197
    2         125958
    3         130641
    4          76969
Name: casual, dtype: int64


In [270]:
registered_df = day_df.groupby(by=["yr", "season"])["registered"].sum()
print(registered_df)

yr  season
0   1         128575
    2         269752
    3         324200
    4         273324
1   1         282151
    2         445315
    3         510838
    4         438507
Name: registered, dtype: int64


In [271]:
users_df = day_df.groupby(by=["yr", "season"]).agg({
    "casual": "sum",
    "registered": "sum"
})

user_count_per_season = users_df["casual"] + users_df["registered"]

users_df["count"] = user_count_per_season

print(users_df)

           casual  registered   count
yr season                            
0  1        21425      128575  150000
   2        77564      269752  347316
   3        95450      324200  419650
   4        52813      273324  326137
1  1        39197      282151  321348
   2       125958      445315  571273
   3       130641      510838  641479
   4        76969      438507  515476


In [272]:
weekday_df = day_df.groupby(by="weekday")["cnt"].sum()
print(weekday_df)

weekday
0    444027
1    455503
2    469109
3    473048
4    485395
5    487790
6    477807
Name: cnt, dtype: int64


In [273]:
day_df.groupby(by=["yr", "mnth", "weekday"])["cnt"].sum()

yr  mnth  weekday
0   1     0           4909
          1           6587
          2           5493
          3           4918
          4           5370
                     ...  
1   12    2          18677
          3          16756
          4          17149
          5          17337
          6          18910
Name: cnt, Length: 168, dtype: int64

**Insight:**
- xxx
- xxx

## Visualization & Explanatory Analysis

### Pertanyaan 1:

### Pertanyaan 2:

**Insight:**
- xxx
- xxx

## Analisis Lanjutan (Opsional)

## Conclusion

- Conclution pertanyaan 1
- Conclution pertanyaan 2