# Drop variables with a range of 0
In the correlation analysis it could be shown that there are variables which can't be correlated because they only contain 0 values. Therefore, they don't have any value range or variance that can be usefully provided to a model. Therefore these variables are dropped.

In [2]:
import pandas as pd
import os

In [3]:
# data import
current_dir = os.getcwd()

# construct path to the project data folder
data_dir = os.path.join(current_dir, '..', '..', 'Data','Sonar_Measures')

# load SonarQube measure data (without duplicates)
df = pd.read_csv(os.path.join(data_dir, 'sonar_measures_v1_v2_usable_vars.csv'), low_memory=False)
df

Unnamed: 0,PROJECT_ID,SQ_ANALYSIS_DATE,CLASSES,FILES,FUNCTIONS,COMMENT_LINES,COMMENT_LINES_DENSITY,COMPLEXITY,FILE_COMPLEXITY,CLASS_COMPLEXITY,...,NEW_SQALE_DEBT_RATIO,VULNERABILITIES,RELIABILITY_REMEDIATION_EFFORT,RELIABILITY_RATING,SECURITY_REMEDIATION_EFFORT,SECURITY_RATING,WONT_FIX_ISSUES,PACKAGE_DEPENDENCY_CYCLES,database,DIRECTORIES
0,accumulo,2008-07-07 14:52:05,2108.0,1103.0,17295.0,13509.0,6.2,43137.0,40.6,20.4,...,0.000000,838,7322,5,9505,4,0,0,Version1,
1,accumulo,2008-07-07 12:31:47,2108.0,1103.0,17295.0,13507.0,6.2,43137.0,40.6,20.4,...,0.222222,838,7081,5,9505,4,0,0,Version1,
2,accumulo,2008-07-05 18:54:27,2108.0,1103.0,17295.0,13507.0,6.2,43137.0,40.6,20.4,...,0.222222,838,7081,5,9505,4,0,0,Version1,
3,accumulo,2008-07-03 20:21:40,2108.0,1103.0,17295.0,13507.0,6.2,43137.0,40.6,20.4,...,0.674560,838,7322,5,9505,4,0,0,Version1,
4,accumulo,2008-07-02 00:12:36,2108.0,1103.0,17295.0,13507.0,6.2,43137.0,40.6,20.4,...,0.671668,838,7322,5,9505,4,0,0,Version1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140743,vfs,2002-08-20 06:10:50,69.0,65.0,425.0,1536.0,24.2,690.0,10.6,10.0,...,2.025316,1,0,1,30,4,0,0,Version2,14.0
140744,vfs,2002-08-20 02:57:02,69.0,65.0,422.0,1533.0,24.1,693.0,10.7,10.0,...,2.184236,1,0,1,30,4,0,0,Version2,14.0
140745,vfs,2002-07-19 11:54:15,69.0,65.0,421.0,1513.0,24.1,687.0,10.6,10.0,...,2.613333,1,0,1,30,4,0,0,Version2,14.0
140746,vfs,2002-07-18 16:47:24,69.0,65.0,421.0,1513.0,24.1,687.0,10.6,10.0,...,2.613333,1,0,1,30,4,0,0,Version2,14.0


In [6]:
# find variables that can't be correlated
df_num = df.select_dtypes(include='number')
corr_matrix = df_num.corr()
uncorrelated_cols = corr_matrix.columns[corr_matrix.isnull().all()].tolist()
print(f"Variables for which correlation calculation isn't possible: {uncorrelated_cols}")

Variables for which correlation calculation isn't possible: ['COVERAGE', 'FALSE_POSITIVE_ISSUES', 'AFFERENT_COUPLINGS', 'EFFERENT_COUPLINGS', 'LINE_COVERAGE', 'NUMBER_OF_CLASSES_AND_INTERFACES', 'REOPENED_ISSUES', 'WONT_FIX_ISSUES', 'PACKAGE_DEPENDENCY_CYCLES']


In [7]:
# investigate range of these variables
for col in uncorrelated_cols:
    min_val = df_num[col].min()
    max_val = df_num[col].max()
    print(f"Range {col}: {min_val - max_val}")

Range COVERAGE: 0.0
Range FALSE_POSITIVE_ISSUES: 0
Range AFFERENT_COUPLINGS: 0
Range EFFERENT_COUPLINGS: 0
Range LINE_COVERAGE: 0.0
Range NUMBER_OF_CLASSES_AND_INTERFACES: 0
Range REOPENED_ISSUES: 0
Range WONT_FIX_ISSUES: 0
Range PACKAGE_DEPENDENCY_CYCLES: 0


When correlating the numerical variables, there is a list of variables for which there is no correlation possible. This is because they have only value 0 over all rows. These variables are dropped.

In [8]:
df.drop(uncorrelated_cols, axis = 1, inplace = True)

In [9]:
df

Unnamed: 0,PROJECT_ID,SQ_ANALYSIS_DATE,CLASSES,FILES,FUNCTIONS,COMMENT_LINES,COMMENT_LINES_DENSITY,COMPLEXITY,FILE_COMPLEXITY,CLASS_COMPLEXITY,...,QUALITY_GATE_DETAILS,QUALITY_PROFILES,NEW_SQALE_DEBT_RATIO,VULNERABILITIES,RELIABILITY_REMEDIATION_EFFORT,RELIABILITY_RATING,SECURITY_REMEDIATION_EFFORT,SECURITY_RATING,database,DIRECTORIES
0,accumulo,2008-07-07 14:52:05,2108.0,1103.0,17295.0,13509.0,6.2,43137.0,40.6,20.4,...,"{""level"":""ERROR"",""conditions"":[{""metric"":""bloc...","[{""key"":""css-sonar-way-41536"",""language"":""css""...",0.000000,838,7322,5,9505,4,Version1,
1,accumulo,2008-07-07 12:31:47,2108.0,1103.0,17295.0,13507.0,6.2,43137.0,40.6,20.4,...,"{""level"":""ERROR"",""conditions"":[{""metric"":""bloc...","[{""key"":""css-sonar-way-41536"",""language"":""css""...",0.222222,838,7081,5,9505,4,Version1,
2,accumulo,2008-07-05 18:54:27,2108.0,1103.0,17295.0,13507.0,6.2,43137.0,40.6,20.4,...,"{""level"":""ERROR"",""conditions"":[{""metric"":""bloc...","[{""key"":""css-sonar-way-41536"",""language"":""css""...",0.222222,838,7081,5,9505,4,Version1,
3,accumulo,2008-07-03 20:21:40,2108.0,1103.0,17295.0,13507.0,6.2,43137.0,40.6,20.4,...,"{""level"":""ERROR"",""conditions"":[{""metric"":""bloc...","[{""key"":""css-sonar-way-41536"",""language"":""css""...",0.674560,838,7322,5,9505,4,Version1,
4,accumulo,2008-07-02 00:12:36,2108.0,1103.0,17295.0,13507.0,6.2,43137.0,40.6,20.4,...,"{""level"":""ERROR"",""conditions"":[{""metric"":""bloc...","[{""key"":""css-sonar-way-41536"",""language"":""css""...",0.671668,838,7322,5,9505,4,Version1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140743,vfs,2002-08-20 06:10:50,69.0,65.0,425.0,1536.0,24.2,690.0,10.6,10.0,...,"{""level"":""ERROR"";""conditions"":[{""metric"":""bloc...","[{""key"":""java-sonar-way-04122"";""language"":""jav...",2.025316,1,0,1,30,4,Version2,14.0
140744,vfs,2002-08-20 02:57:02,69.0,65.0,422.0,1533.0,24.1,693.0,10.7,10.0,...,"{""level"":""ERROR"";""conditions"":[{""metric"":""bloc...","[{""key"":""java-sonar-way-04122"";""language"":""jav...",2.184236,1,0,1,30,4,Version2,14.0
140745,vfs,2002-07-19 11:54:15,69.0,65.0,421.0,1513.0,24.1,687.0,10.6,10.0,...,"{""level"":""ERROR"";""conditions"":[{""metric"":""bloc...","[{""key"":""java-sonar-way-04122"";""language"":""jav...",2.613333,1,0,1,30,4,Version2,14.0
140746,vfs,2002-07-18 16:47:24,69.0,65.0,421.0,1513.0,24.1,687.0,10.6,10.0,...,"{""level"":""ERROR"";""conditions"":[{""metric"":""bloc...","[{""key"":""java-sonar-way-04122"";""language"":""jav...",2.613333,1,0,1,30,4,Version2,14.0


## Result
The dataframe only has 53 columns left after static numerical variables have been removed.

In [None]:
df.to_csv(os.path.join(data_dir, 'sonar_measures_v1_v2_no_statics.csv'), index = False)