### 怎样对数值列按分组的归一化

归一化的好处：1.更容易做横向对比，2.机器学习模型学的更快，性能更好

In [1]:
import pandas as pd

In [2]:
ratings = pd.read_csv(
    "C:/Users/THE KEY/Desktop/python_datum/pandas/data/myself_data/movies/ratings.csv",
    sep = ",",
    engine = "python"
)

In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
# drop = True让userId不再是列了，只单单做索引
ratings.set_index("userId",inplace = True,drop = True)

In [5]:
# 发现53号用户给每个电影都打了5.0分，所以无法归一化，手动给它赋值1
i = 0
for x in ratings.index:
    min_value = ratings["rating"][x].min()
    max_value = ratings["rating"][x].max()
    if max_value - min_value != 0:
        i += 1
    elif max_value - min_value == 0:
        print(x)
        break

53


In [6]:
# 实现按照用户ID分组，然后对评分进行归一化(max_value - min_value)
def ratings_norm(df):
    min_value = df["rating"].min()
    max_value = df["rating"].max()
    if (max_value - min_value) != 0:
        df["rating_norm"] = df["rating"].apply(lambda x : (x - min_value)/(max_value - min_value))
    elif min_value == 5.0:
        df["rating_norm"] = 1
    return df
ratings = ratings.groupby(ratings.index).apply(ratings_norm)

In [7]:
ratings

Unnamed: 0_level_0,Unnamed: 1_level_0,movieId,rating,timestamp,rating_norm
userId,userId,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,1,4.0,964982703,0.750000
1,1,3,4.0,964981247,0.750000
1,1,6,4.0,964982224,0.750000
1,1,47,5.0,964983815,1.000000
1,1,50,5.0,964982931,1.000000
...,...,...,...,...,...
610,610,166534,4.0,1493848402,0.777778
610,610,168248,5.0,1493850091,1.000000
610,610,168250,5.0,1494273047,1.000000
610,610,168252,5.0,1493846352,1.000000


In [8]:
# 特殊的53号用户
ratings.loc[53,'rating_norm'].head()

userId
53    1.0
53    1.0
53    1.0
53    1.0
53    1.0
Name: rating_norm, dtype: float64

### ----------------------------------------------------------------------

### 怎样取每个分组的TOPN数据

In [9]:
df = "C:/Users/THE KEY/Desktop/python_datum/pandas/data/weather/weather_beijing.xlsx"
wea = pd.read_excel(df)
wea = pd.DataFrame(wea)
wea.loc[:,'最高温'] = wea['最高温'].str.replace("°","").astype(int)
wea.loc[:,'最低温'] = wea['最低温'].str.strip("°")
wea.loc[[0],"最低温"] = '-7'
wea['最低温'][455]
wea.loc[[455],"最低温"] = '0'
wea.loc[:,'最低温'] = wea['最低温'].astype(int)
# wea.set_index('日期', inplace= True)
wea.head()

Unnamed: 0,日期,最高温,最低温,天气,风力风向,空气质量指数
0,2011-01-01 周六,-2,-7,多云~阴,无持续风向微风,
1,2011-01-02 周日,-2,-7,多云,无持续风向微风,
2,2011-01-03 周一,-2,-6,多云~阴,西北风~北风3-4级~4-5级,
3,2011-01-04 周二,-2,-9,晴,北风5-6级,
4,2011-01-05 周三,-2,-10,晴,北风~无持续风向3-4级~微风,


In [10]:
wea["month"] = wea["日期"].str[:7]
wea.head()

Unnamed: 0,日期,最高温,最低温,天气,风力风向,空气质量指数,month
0,2011-01-01 周六,-2,-7,多云~阴,无持续风向微风,,2011-01
1,2011-01-02 周日,-2,-7,多云,无持续风向微风,,2011-01
2,2011-01-03 周一,-2,-6,多云~阴,西北风~北风3-4级~4-5级,,2011-01
3,2011-01-04 周二,-2,-9,晴,北风5-6级,,2011-01
4,2011-01-05 周三,-2,-10,晴,北风~无持续风向3-4级~微风,,2011-01


In [11]:
# -topn:是一个切片语法
def tem_top_N(df,topn):
    return df.sort_values(by="最高温")[["日期","最高温"]][-topn:]
wea.groupby("month").apply(tem_top_N, topn = 3).head(9)

  wea.groupby("month").apply(tem_top_N, topn = 3).head(9)


Unnamed: 0_level_0,Unnamed: 1_level_0,日期,最高温
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011-01,7,2011-01-08 周六,0
2011-01,18,2011-01-30 周日,3
2011-01,19,2011-01-31 周一,7
2011-02,42,2011-02-23 周三,9
2011-02,38,2011-02-19 周六,10
2011-02,43,2011-02-24 周四,11
2011-03,76,2011-03-29 周二,19
2011-03,77,2011-03-30 周三,22
2011-03,78,2011-03-31 周四,22
