### Normalisation des données fichier: INA_Normalized
(Pour chaque chaîne on regarde sur les sujets de chaque mois la proportion dans chacun des thèmes pour pouvoir comparer les chaînes entre-elles)

Il y a sans doute des moyens plus astucieux et moins laborieux de faire la normalisation mais voilà celle qui a été utilisé. 

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Decoder le fichier source

In [8]:
df=pd.read_csv('INA.csv', sep=';', encoding='latin-1') #le fichier original

#### Couper les 6 derniers mois pour comparer des anénes complètes

In [4]:
df2=df[:2183]

### Faire la somme des sujets pour chaque chaine pour chaque mois (les constantes de normalisation)

In [6]:
norm=[]
for column in df[['TF1', 'France 2', 'France 3', 'Canal +', 'Arte', 'M6']]:
    norm.append(df2.groupby(['MOIS'])[column].sum())

### Tout remettre dans l'ordre (lexicographique à mensuel) aout avril ... 

In [7]:
l=[4,3,8,1,7,6,5,0,11,10,9,2]
new_idx=[]
for i in range(12):
    new_idx.append(l[i]*12+l[i])

In [8]:
new_idx

[52, 39, 104, 13, 91, 78, 65, 0, 143, 130, 117, 26]

In [12]:
new_idx_full=np.arange(52,52+12+1)
k=12
for i in range(11):
    new_idx_full=np.concatenate((new_idx_full,np.arange(new_idx[i+1],new_idx[i+1]+k+1)))

In [18]:
new_idx_full

array([ 52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
         0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38])

#### Maintenant que l'on a le tableau des permutations complètent on peut remettre dans l'ordre.

In [15]:
norm_array=np.array(norm)

In [21]:
norm_array_reorder=norm_array[:,new_idx_full]

#### Il faut encore changer les dimensions en faisant attention à ne pas mélanger les dimensions pour diviser membre à membre les deux tableaux

In [24]:
norm_tensor=np.reshape(norm_array_reorder, (6,12,13)) #Chaque array une chaine et dans l'ordre chaque mois

In [25]:
norm_tensor=np.swapaxes(norm_tensor,1,2)

In [28]:
norm_final=np.repeat(norm_tensor.flatten(), 14)

In [29]:
norm_final=norm_final.reshape((6,len(norm_final)//6))

In [30]:
norm_final=norm_final.transpose()

In [31]:
norm_final

array([[705., 666., 390., 105., 254., 279.],
       [705., 666., 390., 105., 254., 279.],
       [705., 666., 390., 105., 254., 279.],
       ...,
       [450., 495., 369.,  89., 250., 375.],
       [450., 495., 369.,  89., 250., 375.],
       [450., 495., 369.,  89., 250., 375.]])

### Effectuer la normalisation

In [33]:
data=np.array(df2[['TF1', 'France 2', 'France 3', 'Canal +', 'Arte', 'M6']])

In [34]:
data

array([[214., 191.,  88.,  18.,  40.,  49.],
       [ 27.,  42.,  35.,   4.,   0.,  23.],
       [ 35.,  18.,  10.,   1.,   8.,  11.],
       ...,
       [ 17.,  17.,  20.,   0.,   3.,  16.],
       [  7.,   6.,   3.,   5.,  12.,  10.],
       [ 87.,  82.,  67.,  11.,  28.,  77.]])

In [35]:
data_normalized=np.divide(data, norm_final[:-1])

  """Entry point for launching an IPython kernel.


In [36]:
data_normalized

array([[0.3035461 , 0.28678679, 0.22564103, 0.17142857, 0.15748031,
        0.17562724],
       [0.03829787, 0.06306306, 0.08974359, 0.03809524, 0.        ,
        0.08243728],
       [0.04964539, 0.02702703, 0.02564103, 0.00952381, 0.03149606,
        0.03942652],
       ...,
       [0.03777778, 0.03434343, 0.05420054, 0.        , 0.012     ,
        0.04266667],
       [0.01555556, 0.01212121, 0.00813008, 0.05617978, 0.048     ,
        0.02666667],
       [0.19333333, 0.16565657, 0.18157182, 0.12359551, 0.112     ,
        0.20533333]])

#### Repackager dans un dataframe

In [37]:
df_norm=pd.DataFrame(data=data_normalized, columns=['TF1', 'France 2', 'France 3', 'Canal +', 'Arte', 'M6'])

In [38]:
df2_norm=df2

In [353]:
df2_norm['Totaux']=df['TF1']+df['France 2']+df['France 3']+df['Canal +']+ df['Arte']+ df['M6'] #chaque valeur est la proportion du thème par rapport aux nombres de sujets abordées ce mois-ci pour chaque chaine télévisée.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


##### Pour verifier que tout va bien la somme des proportions par thème fait 1 et il y a 6 thèmes.

In [356]:
np.sum(np.array(df2_norm['Totaux'][:14]))#YEEEEEEEES

6.000000000000001

In [40]:
df2_norm

Unnamed: 0,MOIS,THEMATIQUES,TF1,France 2,France 3,Canal +,Arte,M6,Totaux
0,janv.-05,Catastrophes,0.303546,0.286787,0.225641,0.171429,0.157480,0.175627,600
1,janv.-05,Culture-loisirs,0.038298,0.063063,0.089744,0.038095,0.000000,0.082437,131
2,janv.-05,Economie,0.049645,0.027027,0.025641,0.009524,0.031496,0.039427,83
3,janv.-05,Education,0.019858,0.018018,0.010256,0.028571,0.011811,0.028674,44
4,janv.-05,Environnement,0.043972,0.037538,0.038462,0.009524,0.011811,0.035842,85
5,janv.-05,Faits divers,0.034043,0.028529,0.023077,0.009524,0.003937,0.050179,68
6,janv.-05,Histoire-hommages,0.051064,0.057057,0.061538,0.095238,0.090551,0.068100,150
7,janv.-05,International,0.073759,0.091592,0.138462,0.142857,0.425197,0.093190,316
8,janv.-05,Justice,0.028369,0.040541,0.041026,0.104762,0.043307,0.025090,92
9,janv.-05,Politique France,0.038298,0.037538,0.064103,0.057143,0.086614,0.028674,113


### Save to pkl format

In [5]:
#df2_norm.to_pickle("./INA_normalized.pkl")