# 통계적 사고 세션 과제
---
### 9기 김서진

- 아이디어 1. 도로, 거주지역, 산으로 분리하여 다시 분석한 뒤, 통계적 사고를 기반으로 지역마다 유의미한 차이가 있는지 t-test혹은 분산분석을 통해 결과 도출.
    - 유의미한 차이가 있을 경우, 유의미하게 낮은 지역을 선정해서 공원을 짓자고 제안할 수 있음
    - 테스트 항목 : 편차, 평균 농도. 변화추이를 분석하는 것도 중요한 요소일듯?

- 전체 지도 데이터를 보고 편차가 낮은 지역 중심으로 3 지역 선정, 해당 지역의 시간별 변화와 평균적인 농도를 기반으로 하여 공원 부지 적합 여부 살피기

- 아이디어 2. 새롭게 찾은 공원 데이터와 연간 데이터 간의 상관관계 도출
    - 히트맵을 통해 corr 을 찾고, 공원 비율이 높은 지역 5개와 낮은 지역 5개의 오염농도 데이터를 가져와, 두 군집간의 차이가 유의미한지 확인하기.
    
- 이 ipynb 파일에서는 아이디어 2에 대한 분석을 실제로 진행해보고자 한다. 

In [1]:
import numpy as np
import pandas as pd

In [2]:
import chardet
year_cont = '/Users/gimseojin/Desktop/2023-1/DSL/EDA_project/EDA_data/서울시 년도별 평균 대기오염도 정보.csv'
with open(year_cont, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
result

{'encoding': 'EUC-KR', 'confidence': 0.99, 'language': 'Korean'}

In [3]:
year_cont_df = pd.read_csv('/Users/gimseojin/Desktop/2023-1/DSL/EDA_project/EDA_data/서울시 년도별 평균 대기오염도 정보.csv', skiprows=0, encoding = 'EUC-KR')
year_cont_df.head()

Unnamed: 0,측정년도,측정소명,이산화질소농도(ppm),오존농도(ppm),일산화탄소농도(ppm),아황산가스(ppm),미세먼지(㎍/㎥),초미세먼지(㎍/㎥)
0,2023,강남구,0.036,0.015,0.6,0.004,44.0,29.0
1,2023,강남대로,0.04,0.01,0.9,0.004,48.0,32.0
2,2023,강동구,0.031,0.012,0.6,0.003,48.0,34.0
3,2023,강변북로,0.043,0.01,0.8,0.003,47.0,33.0
4,2023,강북구,0.03,0.018,0.7,0.003,44.0,28.0


In [4]:
# 편의를 위해 열 이름 변경
year_cont_df.rename(columns={"측정년도":"Year", "측정소명":"Where","이산화질소농도(ppm)":"NO2", "오존농도(ppm)":"O3","일산화탄소농도(ppm)":"CO",
                            "아황산가스(ppm)":"SO2", "미세먼지(㎍/㎥)":"Dust", "초미세먼지(㎍/㎥)":"F_dust"}, inplace=True)
year_cont_df

Unnamed: 0,Year,Where,NO2,O3,CO,SO2,Dust,F_dust
0,2023,강남구,0.036,0.015,0.6,0.004,44.0,29.0
1,2023,강남대로,0.040,0.010,0.9,0.004,48.0,32.0
2,2023,강동구,0.031,0.012,0.6,0.003,48.0,34.0
3,2023,강변북로,0.043,0.010,0.8,0.003,47.0,33.0
4,2023,강북구,0.030,0.018,0.7,0.003,44.0,28.0
...,...,...,...,...,...,...,...,...
1223,1987,서초구,0.037,0.017,2.9,0.064,,
1224,1987,성동구,0.035,0.022,3.8,0.094,,
1225,1987,송파구,0.032,0.014,1.8,,,
1226,1987,송파구2,0.048,0.005,2.0,0.040,,


In [5]:
# 결측치 보간
mean_no2 = year_cont_df['NO2'].mean()
mean_o3 = year_cont_df['O3'].mean()
mean_co = year_cont_df['CO'].mean()
mean_so2 = year_cont_df['SO2'].mean()
year_cont_df['NO2'].fillna(mean_no2, inplace=True)
year_cont_df['O3'].fillna(mean_o3, inplace=True)
year_cont_df['CO'].fillna(mean_co, inplace=True)
year_cont_df['SO2'].fillna(mean_so2, inplace=True)
year_cont_df.isnull().sum()

Year        0
Where       0
NO2         0
O3          0
CO          0
SO2         0
Dust      188
F_dust    710
dtype: int64

In [6]:
# 어떻게 보간을 해야할 지 고민이 됩니다 ㅜㅜ 일단 선형보간법으로 결측치 제거
year_cont_df['Dust'].interpolate(method = 'linear', inplace = True)
year_cont_df['F_dust'].interpolate(method = 'linear', inplace = True)
year_cont_df.isnull().sum()

Year      0
Where     0
NO2       0
O3        0
CO        0
SO2       0
Dust      0
F_dust    0
dtype: int64

In [7]:
import chardet
park_cont = 'EDA_data/자치구별 공원율 통계.csv'
with open(park_cont, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
result

{'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}

In [8]:
park = pd.read_csv('EDA_data/자치구별 공원율 통계.csv', encoding = 'utf-8', skiprows=0, header=1)
park.drop('자치구별(1)', axis=1, inplace=True)
park

Unnamed: 0,자치구별(2),행정구역면적,공원면적,자연공원(국립)면적,도시자연공원구역면적
0,종로구,23912936,11404091,4998000,3315969
1,중구,9960262,3178194,-,1790965
2,용산구,21866145,1775550,-,54875
3,성동구,16859899,3074323,-,72648
4,광진구,17062995,3459842,-,2097470
5,동대문구,14215806,1216012,-,96500
6,중랑구,18495584,5232976,-,2895864
7,성북구,24576989,8491596,3864000,2899316
8,강북구,23600441,14383641,11899000,412374
9,도봉구,20651004,10160375,8703000,335258


In [9]:
park.dtypes

자치구별(2)       object
행정구역면적        object
공원면적           int64
자연공원(국립)면적    object
도시자연공원구역면적    object
dtype: object

In [10]:
park.drop(25, inplace=True)
park

Unnamed: 0,자치구별(2),행정구역면적,공원면적,자연공원(국립)면적,도시자연공원구역면적
0,종로구,23912936,11404091,4998000,3315969
1,중구,9960262,3178194,-,1790965
2,용산구,21866145,1775550,-,54875
3,성동구,16859899,3074323,-,72648
4,광진구,17062995,3459842,-,2097470
5,동대문구,14215806,1216012,-,96500
6,중랑구,18495584,5232976,-,2895864
7,성북구,24576989,8491596,3864000,2899316
8,강북구,23600441,14383641,11899000,412374
9,도봉구,20651004,10160375,8703000,335258


In [11]:
park['행정구역면적'] = park['행정구역면적'].astype(str).astype(int)
park.dtypes

자치구별(2)       object
행정구역면적         int64
공원면적           int64
자연공원(국립)면적    object
도시자연공원구역면적    object
dtype: object

In [12]:
park['Park ratio'] = park['공원면적']/park['행정구역면적']
park

Unnamed: 0,자치구별(2),행정구역면적,공원면적,자연공원(국립)면적,도시자연공원구역면적,Park ratio
0,종로구,23912936,11404091,4998000,3315969,0.4769
1,중구,9960262,3178194,-,1790965,0.319087
2,용산구,21866145,1775550,-,54875,0.081201
3,성동구,16859899,3074323,-,72648,0.182345
4,광진구,17062995,3459842,-,2097470,0.202769
5,동대문구,14215806,1216012,-,96500,0.085539
6,중랑구,18495584,5232976,-,2895864,0.282931
7,성북구,24576989,8491596,3864000,2899316,0.34551
8,강북구,23600441,14383641,11899000,412374,0.609465
9,도봉구,20651004,10160375,8703000,335258,0.492004


In [13]:
park.sort_values(by = 'Park ratio', ascending = False, inplace = False)

Unnamed: 0,자치구별(2),행정구역면적,공원면적,자연공원(국립)면적,도시자연공원구역면적,Park ratio
8,강북구,23600441,14383641,11899000,412374,0.609465
9,도봉구,20651004,10160375,8703000,335258,0.492004
11,은평구,29710522,14368192,7663000,4406372,0.483606
0,종로구,23912936,11404091,4998000,3315969,0.4769
20,관악구,29568314,12484990,-,10418474,0.422242
10,노원구,35439209,14742942,-,12798824,0.416007
7,성북구,24576989,8491596,3864000,2899316,0.34551
21,서초구,46981621,15072028,-,12337229,0.320807
1,중구,9960262,3178194,-,1790965,0.319087
12,서대문구,17626389,5098922,218000,3661124,0.289278


In [17]:
group_2 = year_cont_df.groupby('Where')
mean = group_2.mean()
mean.drop('Year', axis = 1, inplace = True)
mean.sort_values('Dust', ascending=False)

Unnamed: 0_level_0,NO2,O3,CO,SO2,Dust,F_dust
Where,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
천호대로,0.040167,0.017039,0.805066,0.007965,70.536087,31.882353
구로구2,0.029957,0.014609,1.378261,0.022261,70.207852,36.0
홍지문,0.045159,0.012348,1.267241,0.009533,63.2,34.9
한강대로,0.045673,0.012813,1.005364,0.006098,60.777778,31.037037
홍릉로,0.051138,0.017232,1.075862,0.007136,60.413448,32.896552
청계천로,0.051129,0.014412,0.996774,0.00859,60.19957,31.645161
송파구2,0.034581,0.013387,1.156071,0.015796,60.064111,36.0
신촌로,0.056323,0.011194,1.112903,0.009323,58.903656,32.516129
영등포로,0.055032,0.012864,1.103226,0.00759,58.897419,32.516129
성동구,0.033946,0.016378,1.051351,0.017568,58.3924,31.899614


'구'를 기준으로 공원 비율이 높았던 구 3개를 선정하여 미세먼지 데이터를 수집하였다. 해당 구는 강북 도봉 은평 3개 구였다.
공원율 데이터에서 공원 비율이 낮았던 구 3개는 강서 동대문 용산 3개 구에 대한 미세먼지 연간 데이터를 수집한 뒤, 두 집단에 대한 차이를 t-test를 통해 검증하고자 하였다. 

In [19]:
kb_df = year_cont_df[year_cont_df['Where']=='강북구']
db_df = year_cont_df[year_cont_df['Where']=='도봉구']
ep_df = year_cont_df[year_cont_df['Where']=='은평구']
ks_df = year_cont_df[year_cont_df['Where']=='강서구']
dd_df = year_cont_df[year_cont_df['Where']=='동대문구']
ys_df = year_cont_df[year_cont_df['Where']=='용산구']

In [37]:
kb_arr = kb_df['Dust'].to_numpy()
db_arr = db_df['Dust'].to_numpy()
ep_arr = ep_df['Dust'].to_numpy()

In [40]:
high_arr = np.concatenate([kb_arr,db_arr,ep_arr])

In [42]:
low_arr = np.concatenate([ks_df['Dust'].to_numpy(),dd_df['Dust'].to_numpy(),ys_df['Dust'].to_numpy()])
low_arr

array([56.        , 37.        , 41.        , 38.        , 47.        ,
       39.        , 47.        , 51.        , 48.        , 46.        ,
       46.        , 44.        , 50.        , 49.        , 57.        ,
       57.        , 61.        , 65.        , 62.        , 61.        ,
       68.        , 75.        , 66.        , 57.        , 56.        ,
       58.        , 60.        , 67.        , 68.        , 86.9       ,
       77.66      , 71.36      , 71.33333333, 43.        , 34.        ,
       38.        , 33.        , 40.        , 36.        , 44.        ,
       49.        , 44.        , 46.        , 45.        , 43.        ,
       49.        , 52.        , 53.        , 49.        , 63.        ,
       65.        , 47.        , 63.        , 64.        , 74.        ,
       73.        , 58.        , 66.        , 64.        , 79.        ,
       91.        , 97.        , 84.38      , 41.        , 33.        ,
       37.        , 36.        , 35.        , 34.        , 39.  

In [44]:
import scipy.stats as stats

shapiro_test1 = stats.shapiro(high_arr)
shapiro_test1

ShapiroResult(statistic=0.9467652440071106, pvalue=0.0008510627667419612)

In [45]:
shapiro_test2 = stats.shapiro(low_arr)
shapiro_test2

ShapiroResult(statistic=0.9715260863304138, pvalue=0.028961440548300743)

shapiro test 결과 정규성을 따르지 않으므로 t-test를 진행할 수 없다. 따라서 비모수 검정을 진행해야 한다.
비모수 검정 방식으로는 mann whitney u 테스트를 활용하였다. 결과는 아래와 같으며, 두 집단간 유의미한 미세먼지 농도차이가 없는 것으로 나타났다.

In [53]:
stats.mannwhitneyu(high_arr, low_arr)

MannwhitneyuResult(statistic=4377.0, pvalue=0.24103575667289612)

임의로 정규성 가정 시의 분석 결과는 아래와 같다. 먼저 등분산성 검정을 실행한 뒤, t-test를 진행한다. 등분산성을 만족하므로 이 때 equal_var = true로 분석을 진행한다.

In [47]:
stats.bartlett(high_arr, low_arr)

BartlettResult(statistic=0.4615670033158464, pvalue=0.4968925413197923)

In [50]:
t_stat, p_value = stats.ttest_ind(high_arr, low_arr, equal_var=True)
print("t-statistics : {}, p-value : {}".format(t_stat, p_value))

t-statistics : -0.6230341615485309, p-value : 0.5340052849336542


p값이 0.05보다 크므로 귀무가설을 기각할 수 없다. 즉, 두 집단 간 평균이 유의미하게 다르다고 할 수 없다. 